100 Real GenAI Engineer Interview Questions

Posted on Wed 03 June 2026 in GenAI

Training & Adaptation Strategy

What approaches exist for training or adapting an LLM? — Pretraining, fine-tuning, instruction tuning, prompt engineering, RAG.
Base model vs instruction-tuned model? — Pure next-token predictor vs one aligned to follow instructions.
When would you choose fine-tuning over RAG? — Stable domain knowledge, style/format control, latency sensitivity.
When would you choose RAG over fine-tuning? — Fresh, changing, or large knowledge that shouldn't be baked into weights.
When is prompt engineering alone sufficient? — Low-stakes tasks where the base model already has the capability.
What decisions must be made before training an LLM? — Objective, data, budget, eval strategy, base model choice.
What trade-offs do you evaluate when picking a training strategy? — Cost, accuracy, latency, maintainability, compliance.
How do you ensure business requirements are met during adaptation? — Define success metrics tied to outcomes, not just loss.
Can prompt engineering be considered a form of training? — No weight updates; it conditions behavior at inference.
When does prompt engineering stop being sufficient? — When accuracy, grounding, or consistency demands retrieval or tuning.

LLM Internals & Behavior

What is tokenization and why does it drive cost and latency? — Text becomes tokens; everything is billed and limited in tokens.
What is the context window? — All tokens the model sees at once: prompt, history, tools, retrieved docs.
What are embeddings and where do you use them? — Dense meaning vectors for search, clustering, retrieval.
What causes hallucinations? — Over-generalization, insufficient context, ungrounded generation.
How do temperature and top-p affect output? — Control randomness and the sampling distribution.
What is the difference between greedy decoding and sampling? — Deterministic top token vs probabilistic selection.
What are emergent abilities and why do they matter for model selection? — Capabilities appearing at scale that affect which model you pick.
Proprietary vs open-weight models — how do you decide? — Performance/turnkey safety vs control, cost, data sovereignty.
What is a distilled model and when do you use one? — A smaller model mimicking a larger one for cost/latency.
How do you select the right LLM for a given business use case? — Match capability, cost, latency, compliance to requirements.

Prompt Engineering in Practice

How does prompt engineering control output behavior? — Constraints, role, examples, and format instructions.
How do you use prompting to reduce hallucinations? — Grounding instructions, "say I don't know," retrieved context.
How do you enforce structured outputs? — Schema/JSON constraints, function calling, validation.
How do you design prompts aligned with business logic? — Encode rules and constraints explicitly and test them.
How do you design prompts that respect compliance requirements? — Bake in PII/policy guardrails and refusal conditions.
What is chain-of-thought and when do you avoid it? — Eliciting reasoning; avoid when latency/cost or leakage matters.
What is few-shot vs zero-shot prompting? — In-context examples vs none.
What is self-consistency? — Sampling multiple reasoning paths and voting.
How do you defend against prompt injection? — Input separation, sanitization, instruction hierarchy, allowlists.
What are stop sequences and prompt templates used for? — Halting generation and standardizing reusable prompts.

RAG

What is RAG and why use it? — Grounding generation in retrieved documents for freshness and accuracy.
How do you evaluate a RAG pipeline? — Assess retrieval and generation separately and jointly.
Beyond accuracy, what RAG metrics matter? — Faithfulness, relevance, retrieval quality.
How do you reduce hallucinations in a RAG system? — Better retriever, reranking, filtering, constrained decoding.
Why do dense retrievers like ColBERT or Contriever help? — Stronger semantic matching, especially fine-tuned on domain data.
What is reranking and where does it sit? — Second-stage scoring to weed out low-quality retrieved content.
How do you choose chunk size and overlap? — Balance context completeness against retrieval precision and truncation.
What is hybrid search? — Combining keyword (sparse) and vector (dense) retrieval.
What is the "lost in the middle" problem? — Models underuse information in the middle of long contexts.
How do you handle multi-hop questions? — Chained or iterative retrieval across documents.
What is metadata filtering and why use it? — Narrowing retrieval using structured attributes.
How do hard negatives improve retrieval? — Similar-but-wrong docs sharpen contrastive training.
What indexing structures power vector search? — ANN methods like HNSW for scalable similarity search.
How does RAG inference differ from a training pipeline? — Real-time retrieval and prompt assembly vs batch weight updates.
How do you keep a RAG knowledge base fresh and traceable? — Versioned ingestion, recency handling, source citation.

Fine-tuning & Data

Full fine-tuning vs parameter-efficient fine-tuning? — All weights vs a small trainable subset.
What is LoRA / QLoRA? — Low-rank adapters, optionally on a quantized base model.
What is quantization and what does it cost you? — Lower precision for speed/memory at some accuracy risk.
What is RLHF and what is DPO? — Preference alignment via a reward model vs direct preference optimization.
What is catastrophic forgetting and how do you avoid it? — Loss of prior skills; mitigate with mixed data, adapters.
What data do you need to fine-tune effectively? — Sufficient, clean, representative, correctly formatted examples.
How do you ensure sensitive data is excluded from fine-tuning sets? — Filtering, masking, provenance checks before training.
How do you verify what data actually went into a model? — Dataset versioning, lineage, and audit records.
What are the risks of synthetic training data? — Distribution drift, bias amplification, model collapse.
When do you use continual or domain-adaptive pretraining? — Large domain corpus that prompting/RAG can't cover.

Agents & Orchestration

What distinguishes an agent from a single LLM call? — Multi-step, tool-using, stateful behavior vs one shot.
What is function/tool calling? — Letting the model invoke external capabilities.
LangChain vs LlamaIndex vs LangGraph — when each? — General app framework vs data/RAG focus vs stateful graph control.
What is the ReAct pattern? — Interleaving reasoning and actions/tool calls.
Plan-and-execute vs reactive agents? — Upfront decomposition vs step-by-step reaction.
How do you manage agent memory? — Short-term context vs persistent long-term stores.
How do you prevent agents from deadlocking or looping? — Termination conditions, step limits, loop detection.
How do you constrain agent tool access? — Scoped permissions, allowlists, validation before execution.
What is MCP (Model Context Protocol)? — A standard for connecting models to tools and context.
How do you evaluate an agentic system? — Task success plus trajectory and tool-use quality.

Evaluation

What automated and human methods evaluate LLM outputs? — Benchmarks, LLM-as-judge, human review, regression suites.
How do you measure hallucination, coherence, and factual accuracy? — Faithfulness checks, grounding scores, human/judge ratings.
Which metrics suit summarization vs QA vs generation? — BLEU, ROUGE, BERTScore, METEOR with task-aware caveats.
What is LLM-as-a-judge and what are its biases? — Model scoring outputs; prone to position/verbosity bias.
What is an evaluation harness? — A framework for systematic, reproducible benchmarking.
Offline vs online evaluation? — Fixed benchmark before deploy vs sampled production traffic.
How do you scale evaluation for A/B tests? — Sampling, automated scoring, statistical comparison.
How do you build a regression suite that catches issues before prod? — Curated cases run on every change.
How do you define "good output" for a GenAI system? — Tie to business/compliance constraints, not vibes.
How do you detect policy violations and data leakage in outputs? — Classifiers, pattern checks, validation layers.

Security, Privacy & Compliance

How would you ensure HIPAA compliance in a healthcare GenAI system? — De-identification, access control, audit, output validation.
How and at what stages do you anonymize/de-identify data? — Before ingestion and before any model exposure.
How do you verify input data is anonymized? — Automated PII detection and validation gates.
How do you ensure outputs don't reintroduce sensitive information? — Output filtering, leakage detection, redaction.
How do you adapt an LLM without exposing financial data? — Masking/tokenization, data separation, exclusion from training.
How do you mask or tokenize confidential information? — Replace PII with tokens/placeholders before processing.
How do you separate training data from inference-time data? — Distinct pipelines and storage with strict boundaries.
How do you implement authorization for data extraction? — RBAC/ABAC enforced through the data layer.
How does access control integrate with RAG and prompts? — Filter retrievable docs by user entitlement before assembly.
How do you prevent unauthorized data entering the model? — Pre-prompt access checks and pipeline allowlists.
How do you audit and log access to sensitive data? — Immutable logs of who accessed what, when.
What guardrails do you put on inputs and outputs? — Validation, refusal logic, policy classifiers.
What are the main safety/ethical risks of deploying GenAI? — Bias, misuse, privacy, misinformation.
How do you handle PII in prompts and logs? — Redaction, retention limits, encryption (relevant under PIPEDA in Canada).
How do you handle data and model versioning for governance? — Track dataset, model, and config provenance.

System Design, Scale & Cost

How do you design a scalable, secure, fast GenAI application? — Layered architecture with caching, routing, guardrails.
How do you handle high concurrency and low-latency inference? — Batching, streaming, autoscaling, caching.
How do you scale embedding generation? — Batch jobs, async pipelines, precomputation.
What pipeline types do GenAI systems use? — Batch, streaming, and hybrid.
How do training, fine-tuning, and RAG pipelines differ? — Offline weight updates vs real-time retrieval/assembly.
How do you monitor system health and model degradation? — Drift detection, quality metrics, alerting.
What factors drive cost in a GenAI application? — Tokens, model choice, retrieval, infra, traffic.
How do you reduce latency without losing quality? — Smaller/distilled models, caching, top-k context only.
When do you use smaller/distilled models or hybrid architectures? — Cost/latency-sensitive paths with model routing.
How do you balance cost, accuracy, performance, and compliance? — Explicit trade-off decisions mapped to business priorities.