RAG Pipeline
Retrieval-Augmented Generation is an architecture that grounds LLM responses in specific retrieved documents, reducing hallucination on facts outside the model training set.
The Problem with Base LLMs
Large language models have two fundamental limitations for production fact-retrieval tasks. First, a knowledge cutoff: a model trained through a certain date cannot answer questions about events after that date. Second, hallucination on specific facts: models generate plausible text that may not correspond to actual facts, particularly on specific figures, citations, or niche domain knowledge not well-represented in training data. Asking GPT-4 about the specific clause in a private contract or the current on-call engineer for a system will produce confident-sounding fabrications.
RAG addresses both by retrieving relevant documents at inference time and including them in the prompt as grounding context. The model is not asked to recall; it is asked to read and synthesize.
The Three Stages
Indexing: documents are preprocessed offline. Each document is split into chunks (typically 256-1024 tokens). Each chunk is converted into a dense vector embedding using an embedding model. The vectors are stored in a vector database with the original text as metadata. This stage runs once and is updated incrementally as documents change.
Retrieval: when a query arrives, it is embedded using the same model. The vector database executes an approximate nearest-neighbor (ANN) search, returning the top-k most semantically similar chunks (typically k=5 to 20). These chunks are the candidates for grounding the response.
Generation: the retrieved chunks are inserted into the LLM prompt alongside the query. The prompt instructs the model to answer based on the provided context. The model synthesizes the context rather than generating from its parametric memory.
Chunking Strategy
The chunking decision affects retrieval quality significantly. Fixed-size chunking (every 512 tokens) is simple but may split logical units mid-sentence. Sentence-boundary chunking preserves semantic coherence but produces variable-size chunks. Semantic chunking uses embeddings to detect topic boundaries and splits on them. The chunk size also matters: small chunks (128 tokens) capture specific facts with high precision but may lose surrounding context. Large chunks (1024 tokens) preserve context but reduce precision and fill the LLM's context window faster.
Reranking
The ANN search returns the top-k chunks by embedding similarity. Embedding similarity is a blunt instrument: it captures semantic closeness but not precise relevance to the specific query intent. A cross-encoder reranker takes each (query, chunk) pair and scores their relevance jointly, using full attention across both texts. This is more accurate than embedding dot products but more expensive (O(k) cross-encoder inference passes). A typical pipeline retrieves 20 candidates via ANN and reranks to the top 5 for the generation stage.
Hybrid Search
Dense retrieval (embedding-based) excels at semantic similarity. Sparse retrieval (BM25, keyword-based) excels at exact term matching. A query for "BERT model paper 2018" benefits from keyword matching (exact terms) not just semantic similarity. Hybrid search combines both: results from dense and sparse retrieval are merged using reciprocal rank fusion or a learned scorer. Production RAG systems at high retrieval quality thresholds typically use hybrid search.
Failure Mode Taxonomy
Retrieval failure: the correct chunks are not in the top-k. Causes include: embedding model domain mismatch (a general model on a specialized corpus), poor chunking (relevant context split across chunk boundaries), or insufficient corpus coverage. Diagnosis: evaluate retrieval quality separately with a labeled evaluation set using recall@k metrics.
Generation failure: the correct chunks are retrieved but the model ignores or misinterprets them. Causes include: prompt design issues (context buried or poorly formatted), context window overflow (too many chunks dilute the relevant ones), or the model's parametric knowledge overriding the retrieved context. Diagnosis: inject the gold context directly into the prompt and test whether the model answers correctly. If yes, the retrieval is failing; if no, the generation prompt is failing.
Interview Tip
The diagnostic question is the signature of a strong answer: "Your RAG pipeline has high retrieval recall but users still get wrong answers. How do you investigate?" The expected structure: confirm retrieval quality with labeled evaluation (is the right chunk in the top-k?), then isolate generation by testing with the gold chunk injected directly. If generation fails even with correct context, the problem is prompt design or context window management. This systematic decomposition, retrieval quality independent of generation quality, is what L5+ candidates demonstrate. Candidates who jump to "fine-tune the LLM" have skipped three diagnostic steps.
Related Concepts
A database purpose-built to store and query high-dimensional embedding vectors. The retrieval layer that makes semantic search and RAG pipelines possible at production scale.
Numerical vector representations of text, images, or other data that encode semantic meaning. The translation layer that converts unstructured content into a form that can be compared mathematically.
A neural network operation that allows each element in a sequence to selectively weight every other element when computing its representation, enabling context-aware processing across arbitrary distances.