100 Real GenAI Engineer Interview Questions
Posted on Wed 03 June 2026 in GenAI
Training & Adaptation Strategy
-
What approaches exist for training or adapting an LLM? — Pretraining, fine-tuning, instruction tuning, prompt engineering, RAG.
-
Base model vs instruction-tuned model? — Pure next-token predictor vs one aligned to follow instructions.
-
When would you choose fine-tuning over RAG? — Stable domain knowledge, style/format control, latency sensitivity.
-
When would you choose RAG over fine-tuning? — Fresh, changing, or large knowledge that shouldn't be baked into weights.
-
When is prompt engineering alone sufficient? — Low-stakes tasks where the base model already has the capability.
-
What decisions must be made before training an LLM? — Objective, data, budget, eval strategy, base model choice.
-
What trade-offs do you evaluate when picking a training strategy? — Cost, accuracy, latency, maintainability, compliance.
-
How do you ensure business requirements are met during adaptation? — Define success metrics tied to outcomes, not just loss.
-
Can prompt engineering be considered a form of training? — No weight updates; it conditions behavior at inference.
-
When does prompt engineering stop being sufficient? — When accuracy, grounding, or consistency demands retrieval or tuning.
LLM Internals & Behavior
-
What is tokenization and why does it drive cost and latency? — Text becomes tokens; everything is billed and limited in tokens.
-
What is the context window? — All tokens the model sees at once: prompt, history, tools, retrieved docs.
-
What are embeddings and where do you use them? — Dense meaning vectors for search, clustering, retrieval.
-
What causes hallucinations? — Over-generalization, insufficient context, ungrounded generation.
-
How do temperature and top-p affect output? — Control randomness and the sampling distribution.
-
What is the difference between greedy decoding and sampling? — Deterministic top token vs probabilistic selection.
-
What are emergent abilities and why do they matter for model selection? — Capabilities appearing at scale that affect which model you pick.
-
Proprietary vs open-weight models — how do you decide? — Performance/turnkey safety vs control, cost, data sovereignty.
-
What is a distilled model and when do you use one? — A smaller model mimicking a larger one for cost/latency.
-
How do you select the right LLM for a given business use case? — Match capability, cost, latency, compliance to requirements.
Prompt Engineering in Practice
-
How does prompt engineering control output behavior? — Constraints, role, examples, and format instructions.
-
How do you use prompting to reduce hallucinations? — Grounding instructions, "say I don't know," retrieved context.
-
How do you enforce structured outputs? — Schema/JSON constraints, function calling, validation.
-
How do you design prompts aligned with business logic? — Encode rules and constraints explicitly and test them.
-
How do you design prompts that respect compliance requirements? — Bake in PII/policy guardrails and refusal conditions.
-
What is chain-of-thought and when do you avoid it? — Eliciting reasoning; avoid when latency/cost or leakage matters.
-
What is few-shot vs zero-shot prompting? — In-context examples vs none.
-
What is self-consistency? — Sampling multiple reasoning paths and voting.
-
How do you defend against prompt injection? — Input separation, sanitization, instruction hierarchy, allowlists.
-
What are stop sequences and prompt templates used for? — Halting generation and standardizing reusable prompts.
RAG
-
What is RAG and why use it? — Grounding generation in retrieved documents for freshness and accuracy.
-
How do you evaluate a RAG pipeline? — Assess retrieval and generation separately and jointly.
-
Beyond accuracy, what RAG metrics matter? — Faithfulness, relevance, retrieval quality.
-
How do you reduce hallucinations in a RAG system? — Better retriever, reranking, filtering, constrained decoding.
-
Why do dense retrievers like ColBERT or Contriever help? — Stronger semantic matching, especially fine-tuned on domain data.
-
What is reranking and where does it sit? — Second-stage scoring to weed out low-quality retrieved content.
-
How do you choose chunk size and overlap? — Balance context completeness against retrieval precision and truncation.
-
What is hybrid search? — Combining keyword (sparse) and vector (dense) retrieval.
-
What is the "lost in the middle" problem? — Models underuse information in the middle of long contexts.
-
How do you handle multi-hop questions? — Chained or iterative retrieval across documents.
-
What is metadata filtering and why use it? — Narrowing retrieval using structured attributes.
-
How do hard negatives improve retrieval? — Similar-but-wrong docs sharpen contrastive training.
-
What indexing structures power vector search? — ANN methods like HNSW for scalable similarity search.
-
How does RAG inference differ from a training pipeline? — Real-time retrieval and prompt assembly vs batch weight updates.
-
How do you keep a RAG knowledge base fresh and traceable? — Versioned ingestion, recency handling, source citation.
Fine-tuning & Data
-
Full fine-tuning vs parameter-efficient fine-tuning? — All weights vs a small trainable subset.
-
What is LoRA / QLoRA? — Low-rank adapters, optionally on a quantized base model.
-
What is quantization and what does it cost you? — Lower precision for speed/memory at some accuracy risk.
-
What is RLHF and what is DPO? — Preference alignment via a reward model vs direct preference optimization.
-
What is catastrophic forgetting and how do you avoid it? — Loss of prior skills; mitigate with mixed data, adapters.
-
What data do you need to fine-tune effectively? — Sufficient, clean, representative, correctly formatted examples.
-
How do you ensure sensitive data is excluded from fine-tuning sets? — Filtering, masking, provenance checks before training.
-
How do you verify what data actually went into a model? — Dataset versioning, lineage, and audit records.
-
What are the risks of synthetic training data? — Distribution drift, bias amplification, model collapse.
-
When do you use continual or domain-adaptive pretraining? — Large domain corpus that prompting/RAG can't cover.
Agents & Orchestration
-
What distinguishes an agent from a single LLM call? — Multi-step, tool-using, stateful behavior vs one shot.
-
What is function/tool calling? — Letting the model invoke external capabilities.
-
LangChain vs LlamaIndex vs LangGraph — when each? — General app framework vs data/RAG focus vs stateful graph control.
-
What is the ReAct pattern? — Interleaving reasoning and actions/tool calls.
-
Plan-and-execute vs reactive agents? — Upfront decomposition vs step-by-step reaction.
-
How do you manage agent memory? — Short-term context vs persistent long-term stores.
-
How do you prevent agents from deadlocking or looping? — Termination conditions, step limits, loop detection.
-
How do you constrain agent tool access? — Scoped permissions, allowlists, validation before execution.
-
What is MCP (Model Context Protocol)? — A standard for connecting models to tools and context.
-
How do you evaluate an agentic system? — Task success plus trajectory and tool-use quality.
Evaluation
-
What automated and human methods evaluate LLM outputs? — Benchmarks, LLM-as-judge, human review, regression suites.
-
How do you measure hallucination, coherence, and factual accuracy? — Faithfulness checks, grounding scores, human/judge ratings.
-
Which metrics suit summarization vs QA vs generation? — BLEU, ROUGE, BERTScore, METEOR with task-aware caveats.
-
What is LLM-as-a-judge and what are its biases? — Model scoring outputs; prone to position/verbosity bias.
-
What is an evaluation harness? — A framework for systematic, reproducible benchmarking.
-
Offline vs online evaluation? — Fixed benchmark before deploy vs sampled production traffic.
-
How do you scale evaluation for A/B tests? — Sampling, automated scoring, statistical comparison.
-
How do you build a regression suite that catches issues before prod? — Curated cases run on every change.
-
How do you define "good output" for a GenAI system? — Tie to business/compliance constraints, not vibes.
-
How do you detect policy violations and data leakage in outputs? — Classifiers, pattern checks, validation layers.
Security, Privacy & Compliance
-
How would you ensure HIPAA compliance in a healthcare GenAI system? — De-identification, access control, audit, output validation.
-
How and at what stages do you anonymize/de-identify data? — Before ingestion and before any model exposure.
-
How do you verify input data is anonymized? — Automated PII detection and validation gates.
-
How do you ensure outputs don't reintroduce sensitive information? — Output filtering, leakage detection, redaction.
-
How do you adapt an LLM without exposing financial data? — Masking/tokenization, data separation, exclusion from training.
-
How do you mask or tokenize confidential information? — Replace PII with tokens/placeholders before processing.
-
How do you separate training data from inference-time data? — Distinct pipelines and storage with strict boundaries.
-
How do you implement authorization for data extraction? — RBAC/ABAC enforced through the data layer.
-
How does access control integrate with RAG and prompts? — Filter retrievable docs by user entitlement before assembly.
-
How do you prevent unauthorized data entering the model? — Pre-prompt access checks and pipeline allowlists.
-
How do you audit and log access to sensitive data? — Immutable logs of who accessed what, when.
-
What guardrails do you put on inputs and outputs? — Validation, refusal logic, policy classifiers.
-
What are the main safety/ethical risks of deploying GenAI? — Bias, misuse, privacy, misinformation.
-
How do you handle PII in prompts and logs? — Redaction, retention limits, encryption (relevant under PIPEDA in Canada).
-
How do you handle data and model versioning for governance? — Track dataset, model, and config provenance.
System Design, Scale & Cost
-
How do you design a scalable, secure, fast GenAI application? — Layered architecture with caching, routing, guardrails.
-
How do you handle high concurrency and low-latency inference? — Batching, streaming, autoscaling, caching.
-
How do you scale embedding generation? — Batch jobs, async pipelines, precomputation.
-
What pipeline types do GenAI systems use? — Batch, streaming, and hybrid.
-
How do training, fine-tuning, and RAG pipelines differ? — Offline weight updates vs real-time retrieval/assembly.
-
How do you monitor system health and model degradation? — Drift detection, quality metrics, alerting.
-
What factors drive cost in a GenAI application? — Tokens, model choice, retrieval, infra, traffic.
-
How do you reduce latency without losing quality? — Smaller/distilled models, caching, top-k context only.
-
When do you use smaller/distilled models or hybrid architectures? — Cost/latency-sensitive paths with model routing.
-
How do you balance cost, accuracy, performance, and compliance? — Explicit trade-off decisions mapped to business priorities.