100 Real GenAI Engineer Interview Questions

Posted on Wed 03 June 2026 in GenAI

Training & Adaptation Strategy

  1. What approaches exist for training or adapting an LLM? — Pretraining, fine-tuning, instruction tuning, prompt engineering, RAG.

  2. Base model vs instruction-tuned model? — Pure next-token predictor vs one aligned to follow instructions.

  3. When would you choose fine-tuning over RAG? — Stable domain knowledge, style/format control, latency sensitivity.

  4. When would you choose RAG over fine-tuning? — Fresh, changing, or large knowledge that shouldn't be baked into weights.

  5. When is prompt engineering alone sufficient? — Low-stakes tasks where the base model already has the capability.

  6. What decisions must be made before training an LLM? — Objective, data, budget, eval strategy, base model choice.

  7. What trade-offs do you evaluate when picking a training strategy? — Cost, accuracy, latency, maintainability, compliance.

  8. How do you ensure business requirements are met during adaptation? — Define success metrics tied to outcomes, not just loss.

  9. Can prompt engineering be considered a form of training? — No weight updates; it conditions behavior at inference.

  10. When does prompt engineering stop being sufficient? — When accuracy, grounding, or consistency demands retrieval or tuning.

LLM Internals & Behavior

  1. What is tokenization and why does it drive cost and latency? — Text becomes tokens; everything is billed and limited in tokens.

  2. What is the context window? — All tokens the model sees at once: prompt, history, tools, retrieved docs.

  3. What are embeddings and where do you use them? — Dense meaning vectors for search, clustering, retrieval.

  4. What causes hallucinations? — Over-generalization, insufficient context, ungrounded generation.

  5. How do temperature and top-p affect output? — Control randomness and the sampling distribution.

  6. What is the difference between greedy decoding and sampling? — Deterministic top token vs probabilistic selection.

  7. What are emergent abilities and why do they matter for model selection? — Capabilities appearing at scale that affect which model you pick.

  8. Proprietary vs open-weight models — how do you decide? — Performance/turnkey safety vs control, cost, data sovereignty.

  9. What is a distilled model and when do you use one? — A smaller model mimicking a larger one for cost/latency.

  10. How do you select the right LLM for a given business use case? — Match capability, cost, latency, compliance to requirements.

Prompt Engineering in Practice

  1. How does prompt engineering control output behavior? — Constraints, role, examples, and format instructions.

  2. How do you use prompting to reduce hallucinations? — Grounding instructions, "say I don't know," retrieved context.

  3. How do you enforce structured outputs? — Schema/JSON constraints, function calling, validation.

  4. How do you design prompts aligned with business logic? — Encode rules and constraints explicitly and test them.

  5. How do you design prompts that respect compliance requirements? — Bake in PII/policy guardrails and refusal conditions.

  6. What is chain-of-thought and when do you avoid it? — Eliciting reasoning; avoid when latency/cost or leakage matters.

  7. What is few-shot vs zero-shot prompting? — In-context examples vs none.

  8. What is self-consistency? — Sampling multiple reasoning paths and voting.

  9. How do you defend against prompt injection? — Input separation, sanitization, instruction hierarchy, allowlists.

  10. What are stop sequences and prompt templates used for? — Halting generation and standardizing reusable prompts.

RAG

  1. What is RAG and why use it? — Grounding generation in retrieved documents for freshness and accuracy.

  2. How do you evaluate a RAG pipeline? — Assess retrieval and generation separately and jointly.

  3. Beyond accuracy, what RAG metrics matter? — Faithfulness, relevance, retrieval quality.

  4. How do you reduce hallucinations in a RAG system? — Better retriever, reranking, filtering, constrained decoding.

  5. Why do dense retrievers like ColBERT or Contriever help? — Stronger semantic matching, especially fine-tuned on domain data.

  6. What is reranking and where does it sit? — Second-stage scoring to weed out low-quality retrieved content.

  7. How do you choose chunk size and overlap? — Balance context completeness against retrieval precision and truncation.

  8. What is hybrid search? — Combining keyword (sparse) and vector (dense) retrieval.

  9. What is the "lost in the middle" problem? — Models underuse information in the middle of long contexts.

  10. How do you handle multi-hop questions? — Chained or iterative retrieval across documents.

  11. What is metadata filtering and why use it? — Narrowing retrieval using structured attributes.

  12. How do hard negatives improve retrieval? — Similar-but-wrong docs sharpen contrastive training.

  13. What indexing structures power vector search? — ANN methods like HNSW for scalable similarity search.

  14. How does RAG inference differ from a training pipeline? — Real-time retrieval and prompt assembly vs batch weight updates.

  15. How do you keep a RAG knowledge base fresh and traceable? — Versioned ingestion, recency handling, source citation.

Fine-tuning & Data

  1. Full fine-tuning vs parameter-efficient fine-tuning? — All weights vs a small trainable subset.

  2. What is LoRA / QLoRA? — Low-rank adapters, optionally on a quantized base model.

  3. What is quantization and what does it cost you? — Lower precision for speed/memory at some accuracy risk.

  4. What is RLHF and what is DPO? — Preference alignment via a reward model vs direct preference optimization.

  5. What is catastrophic forgetting and how do you avoid it? — Loss of prior skills; mitigate with mixed data, adapters.

  6. What data do you need to fine-tune effectively? — Sufficient, clean, representative, correctly formatted examples.

  7. How do you ensure sensitive data is excluded from fine-tuning sets? — Filtering, masking, provenance checks before training.

  8. How do you verify what data actually went into a model? — Dataset versioning, lineage, and audit records.

  9. What are the risks of synthetic training data? — Distribution drift, bias amplification, model collapse.

  10. When do you use continual or domain-adaptive pretraining? — Large domain corpus that prompting/RAG can't cover.

Agents & Orchestration

  1. What distinguishes an agent from a single LLM call? — Multi-step, tool-using, stateful behavior vs one shot.

  2. What is function/tool calling? — Letting the model invoke external capabilities.

  3. LangChain vs LlamaIndex vs LangGraph — when each? — General app framework vs data/RAG focus vs stateful graph control.

  4. What is the ReAct pattern? — Interleaving reasoning and actions/tool calls.

  5. Plan-and-execute vs reactive agents? — Upfront decomposition vs step-by-step reaction.

  6. How do you manage agent memory? — Short-term context vs persistent long-term stores.

  7. How do you prevent agents from deadlocking or looping? — Termination conditions, step limits, loop detection.

  8. How do you constrain agent tool access? — Scoped permissions, allowlists, validation before execution.

  9. What is MCP (Model Context Protocol)? — A standard for connecting models to tools and context.

  10. How do you evaluate an agentic system? — Task success plus trajectory and tool-use quality.

Evaluation

  1. What automated and human methods evaluate LLM outputs? — Benchmarks, LLM-as-judge, human review, regression suites.

  2. How do you measure hallucination, coherence, and factual accuracy? — Faithfulness checks, grounding scores, human/judge ratings.

  3. Which metrics suit summarization vs QA vs generation? — BLEU, ROUGE, BERTScore, METEOR with task-aware caveats.

  4. What is LLM-as-a-judge and what are its biases? — Model scoring outputs; prone to position/verbosity bias.

  5. What is an evaluation harness? — A framework for systematic, reproducible benchmarking.

  6. Offline vs online evaluation? — Fixed benchmark before deploy vs sampled production traffic.

  7. How do you scale evaluation for A/B tests? — Sampling, automated scoring, statistical comparison.

  8. How do you build a regression suite that catches issues before prod? — Curated cases run on every change.

  9. How do you define "good output" for a GenAI system? — Tie to business/compliance constraints, not vibes.

  10. How do you detect policy violations and data leakage in outputs? — Classifiers, pattern checks, validation layers.

Security, Privacy & Compliance

  1. How would you ensure HIPAA compliance in a healthcare GenAI system? — De-identification, access control, audit, output validation.

  2. How and at what stages do you anonymize/de-identify data? — Before ingestion and before any model exposure.

  3. How do you verify input data is anonymized? — Automated PII detection and validation gates.

  4. How do you ensure outputs don't reintroduce sensitive information? — Output filtering, leakage detection, redaction.

  5. How do you adapt an LLM without exposing financial data? — Masking/tokenization, data separation, exclusion from training.

  6. How do you mask or tokenize confidential information? — Replace PII with tokens/placeholders before processing.

  7. How do you separate training data from inference-time data? — Distinct pipelines and storage with strict boundaries.

  8. How do you implement authorization for data extraction? — RBAC/ABAC enforced through the data layer.

  9. How does access control integrate with RAG and prompts? — Filter retrievable docs by user entitlement before assembly.

  10. How do you prevent unauthorized data entering the model? — Pre-prompt access checks and pipeline allowlists.

  11. How do you audit and log access to sensitive data? — Immutable logs of who accessed what, when.

  12. What guardrails do you put on inputs and outputs? — Validation, refusal logic, policy classifiers.

  13. What are the main safety/ethical risks of deploying GenAI? — Bias, misuse, privacy, misinformation.

  14. How do you handle PII in prompts and logs? — Redaction, retention limits, encryption (relevant under PIPEDA in Canada).

  15. How do you handle data and model versioning for governance? — Track dataset, model, and config provenance.

System Design, Scale & Cost

  1. How do you design a scalable, secure, fast GenAI application? — Layered architecture with caching, routing, guardrails.

  2. How do you handle high concurrency and low-latency inference? — Batching, streaming, autoscaling, caching.

  3. How do you scale embedding generation? — Batch jobs, async pipelines, precomputation.

  4. What pipeline types do GenAI systems use? — Batch, streaming, and hybrid.

  5. How do training, fine-tuning, and RAG pipelines differ? — Offline weight updates vs real-time retrieval/assembly.

  6. How do you monitor system health and model degradation? — Drift detection, quality metrics, alerting.

  7. What factors drive cost in a GenAI application? — Tokens, model choice, retrieval, infra, traffic.

  8. How do you reduce latency without losing quality? — Smaller/distilled models, caching, top-k context only.

  9. When do you use smaller/distilled models or hybrid architectures? — Cost/latency-sensitive paths with model routing.

  10. How do you balance cost, accuracy, performance, and compliance? — Explicit trade-off decisions mapped to business priorities.