AI Systems Architecture 18 min read

Designing a RAG Pipeline at Scale

End-to-end design of a production Retrieval-Augmented Generation system: chunking strategies, embedding models, vector DB selection, reranking, and the failure modes that surface at 10M+ documents.

The Problem RAG Solves, and Where It Fails

Fine-tuned models memorise training data. RAG retrieves it at inference time. The distinction matters because knowledge freshness is a correctness requirement in most production systems. A fine-tuned model trained six months ago doesn't know what changed last week. RAG does, provided the retrieval pipeline is accurate.

The failure mode nobody talks about in design interviews: retrieval quality degrades predictably as corpus size grows. A pipeline that achieves recall@10 of 0.91 at 100k documents can drop to 0.74 at 10M. That's not a vector database problem. It's an indexing and chunking problem. Getting this right is the difference between a demo that works and a system that holds up under real query distributions.

The Core Problem

Naive RAG implementations retrieve based on embedding similarity alone. At small scale, this works. At production scale, two failure modes dominate:

Semantic drift at chunk boundaries. If you chunk documents by fixed token count (the most common approach), a chunk boundary often falls mid-sentence or mid-concept. The embedding of an incomplete thought is a poor representation. Retrieval pulls the wrong chunks because the embeddings don't capture the actual semantic unit.

Recall collapse on long-tail queries. Common queries map cleanly to high-density regions of the embedding space. Rare or multi-hop queries don't. They retrieve tangentially related content that looks similar but doesn't answer the question.

Architecture

A production RAG pipeline has four stages: ingestion, retrieval, reranking, and generation. Most engineers over-engineer generation and under-engineer ingestion.

Ingestion

Documents land in blob storage (S3/GCS). An ingestion worker:

  1. Extracts text (PDF parsing, HTML stripping)
  2. Chunks with overlap: 256–512 tokens per chunk, 10–20% overlap to avoid boundary artifacts
  3. Embeds each chunk via a batch embedding API (OpenAI text-embedding-3-large, Cohere embed-v3, or a self-hosted model)
  4. Upserts the vector + metadata into the vector store

The chunking strategy is the most consequential decision here. For structured documents (legal, technical), semantic chunking (splitting at paragraph or section boundaries) outperforms fixed-token chunking by 10–15% on recall@10. For unstructured prose, recursive character splitting with overlap is the pragmatic default.

Retrieval

At query time, embed the user query with the same model used during ingestion (mismatched models produce incomparable vectors). Run an ANN search against the vector store. Return the top-k chunks (k = 20 is a reasonable starting point; tune based on reranker performance).

Hybrid retrieval (combining dense vector and sparse BM25/keyword search) improves recall@10 by 5–12% on most benchmarks. Pinecone supports this natively. For self-hosted setups, Weaviate and Elasticsearch can run BM25 alongside vector search.

Reranking

The vector search returns k candidates. A cross-encoder reranker (Cohere Rerank, cross-encoder/ms-marco-MiniLM-L-6-v2) scores each candidate against the query and re-sorts. Cross-encoders are slower than bi-encoders (they process query+document pairs, not independent vectors) but dramatically more accurate. Expect recall@5 to improve from ~0.78 to ~0.91 after reranking.

Latency cost: ~50–150ms for a cross-encoder over 20 candidates. Budget this into your P99 target.

Generation

Pass the top-3 to top-5 reranked chunks as context to the LLM. Prompt engineering here is implementation-specific. The two things that matter at scale:

Context window management. At large k values, you can exceed the model's context window. Truncate at the retrieval stage, not by silently dropping content mid-chunk.

Citation tracking. Store document_id and chunk_id alongside each vector. Return them with the generated response. This enables source attribution without additional retrieval.

Trade-offs

Managed vs self-hosted vector store. Pinecone eliminates operational overhead and gives you production-grade uptime guarantees, but at ~$0.10/1M vectors/month at scale, costs compound. pgvector on Postgres is free and sufficient up to ~1M vectors with HNSW indexing. Beyond 5M vectors, Pinecone or Weaviate cloud outperform pgvector on P99 query latency.

Embedding model choice. OpenAI text-embedding-3-large achieves state-of-the-art recall on MTEB benchmarks but costs $0.13/1M tokens. At 10M document chunks of 400 tokens average, that's ~$520 for initial ingestion. For high-volume pipelines, Cohere's batched API or a self-hosted BGE model cuts this by 60–80%.

Production Considerations

Index freshness. Document updates require re-embedding and re-indexing the affected chunks. At high ingest rates, this creates write pressure on the vector store. Batch updates hourly rather than per-document unless real-time freshness is a hard requirement.

Embedding drift. If you switch embedding models, existing vectors are incompatible with new vectors. A full re-ingestion is required. Pin your embedding model version and plan for migration costs before going to production.

Monitoring. Instrument retrieval quality with offline evals (RAGAS, TruLens) against a golden dataset of 100–200 representative queries. Run evals after every chunking or retrieval strategy change. Retrieval quality is not visible in production logs. You must measure it explicitly.

Further Reading

  • RAGAS: Automated Evaluation of RAG Pipelines
  • Pinecone Learning Center: hybrid search documentation
  • Designing Data-Intensive Applications, Chapter 3: storage engine internals (relevant for understanding vector index structures)