50 Basic GenAI Engineer Interview Questions

Posted on Tue 02 June 2026 in GenAI

A starter question bank for screening entry-level GenAI engineers. Grouped by theme, covering fundamentals through production concerns.

Fundamentals

What is generative AI vs discriminative AI? — Generative models learn to produce new data; discriminative models learn decision boundaries to classify or predict.
What is a large language model (LLM)? — A neural network trained on large text corpora to predict tokens, capable of broad language tasks.
Transformer architecture — Asks for a high-level grasp of attention-based sequence modeling without recurrence.
Attention mechanism — How a model weighs the relevance of different tokens when producing a representation.
Self-attention vs cross-attention — Self-attention relates tokens within one sequence; cross-attention relates one sequence to another.
Tokens and tokenization — How raw text is split into model-readable units and why the scheme matters.
Context window — The maximum number of tokens a model can attend to at once.
Embeddings — Dense vector representations of meaning, used for search, clustering, and retrieval.
Encoder vs decoder vs encoder-decoder — Architectural families and the task types each suits.
Pretraining vs fine-tuning — Broad foundational training vs targeted adaptation to a task or domain.

Prompt engineering — Designing inputs to reliably steer model behavior.
Zero-shot vs few-shot prompting — No examples vs in-context examples to guide the output.
Chain-of-thought prompting — Eliciting intermediate reasoning steps to improve answers.
Temperature and top-p — Controls over randomness and the sampling distribution of generated tokens.
Greedy decoding vs sampling — Always taking the top token vs probabilistic selection for diversity.
System prompt — Instructions that set persistent behavior and role for the model.
Hallucinations — Why models invent plausible-but-false content and how to mitigate it.
Max tokens parameter — Caps output length and affects cost and truncation.
In-context learning — Learning a task purely from examples in the prompt, no weight updates.
Prompt injection — Adversarial inputs that hijack instructions and defenses against them.

What is RAG and why use it? — Grounding generation in retrieved documents to reduce hallucination and add fresh knowledge.
Vector database — Storage and similarity search over embeddings.
Semantic vs keyword search — Meaning-based retrieval vs literal term matching.
Chunking and chunk size — Splitting documents for retrieval and the tradeoffs of granularity.
Cosine similarity — A measure of vector closeness used to rank retrieved items.
Embedding model in RAG — Converts queries and documents into the shared vector space.
Re-ranking — A second-stage scoring pass to refine retrieved candidates.
Evaluating a RAG system — Measuring retrieval relevance and answer faithfulness.
Hybrid search — Combining sparse (keyword) and dense (vector) retrieval.
Handling context overflow — Strategies when retrieved content exceeds the window.

Full vs parameter-efficient fine-tuning — Updating all weights vs a small subset for cost savings.
LoRA — Low-rank adapters that fine-tune efficiently with few trainable parameters.
Quantization — Reducing numerical precision to shrink models and speed inference.
RLHF — Aligning models to human preferences via a reward signal.
Instruction tuning — Training on instruction-response pairs to improve task following.
Catastrophic forgetting — Loss of prior capabilities after further training.
SFT vs RLHF — Supervised fine-tuning vs preference-based reinforcement.
Fine-tune vs RAG vs prompting — When each approach is the right tool.
Data needed to fine-tune — Quantity, quality, and format requirements.
Overfitting — Memorizing training data and how to detect it.

Evaluating LLM output quality — Approaches to judging open-ended generation.
Generative text metrics — BLEU, ROUGE, perplexity and their limits.
LLM-as-a-judge — Using a model to score outputs and its biases.
Reducing latency — Techniques like streaming, caching, and smaller models.
Controlling inference cost — Token budgeting, model selection, and caching.
Model distillation — Training a smaller model to mimic a larger one.
Guardrails — Constraints on inputs and outputs for safety and compliance.
Monitoring in production — Tracking quality, drift, cost, and failures.
Safety and ethical risks — Bias, misuse, privacy, and misinformation concerns.
Agent vs simple LLM call — Multi-step tool-using systems vs single-shot generation.