AI Systems Inference 22 min read

LLM Inference Optimization: KV-Cache, Continuous Batching & Quantization

How to reduce LLM serving cost and latency by 5–10x. Covers KV-cache mechanics, continuous batching with PagedAttention, GPTQ/AWQ quantization, and speculative decoding, with numbers.

Why LLM Inference Is Expensive

A single forward pass through a 70B parameter model at fp16 requires ~140GB of GPU memory just for weights. Add the KV-cache for a 4096-token context and you're at ~155GB, which is beyond a single A100 (80GB). You need tensor parallelism across two GPUs before you serve a single request.

At production throughput, the cost structure is dominated by two factors: GPU memory bandwidth (inference is memory-bandwidth-bound, not compute-bound, for most models) and batch utilisation. Optimising inference means attacking both.

KV-Cache: What It Is and Why It Gets Wasted

During autoregressive generation, the transformer computes key and value matrices for every token in the context at every layer. For a 100-token prefix, that's 100 × num_layers key and value matrices computed on every forward pass, except the first.

The KV-cache stores these matrices after the first computation and reuses them for subsequent tokens. Without the cache, a 500-token generation requires computing ~125,000 matrix products (500 tokens × 250 layers for a 70B model). With caching, you compute once for the prompt and once per generated token.

The problem: naive KV-cache implementations pre-allocate contiguous GPU memory for the maximum sequence length. In practice, requests vary widely in length. A 500-token request in a batch with a 4096-token maximum wastes ~87% of its KV-cache allocation. On a full A100, this limits real concurrency to 8–12 simultaneous requests, well below what the hardware can theoretically handle.

PagedAttention and vLLM

vLLM's PagedAttention addresses the memory waste problem. Instead of contiguous pre-allocation, it manages KV-cache in fixed-size pages (blocks of tokens), allocated dynamically as generation proceeds. The physical GPU memory layout becomes non-contiguous; a translation layer maps logical sequence positions to physical block addresses.

The result: near-zero memory fragmentation, ~2–4x higher throughput on the same hardware compared to naive implementations. At 100 QPS on a 13B model, PagedAttention typically doubles the number of concurrent requests that fit in GPU memory.

vLLM, TGI (Text Generation Inference by Hugging Face), and TensorRT-LLM all implement PagedAttention or equivalent block-based KV-cache management.

Continuous Batching

Static batching processes a batch until the longest sequence in the batch completes. This wastes GPU cycles: when a short sequence finishes, its GPU resources sit idle until the batch completes.

Continuous batching (also called iteration-level scheduling) replaces completed sequences with new requests at each token generation step. The batch size remains roughly constant; finished sequences leave and new sequences enter after each iteration.

In practice: continuous batching improves GPU utilisation from ~40–60% (static batching) to ~75–90% under variable request lengths. For serving, this translates directly to throughput and cost.

Quantization

Quantization reduces weight precision from fp16 (2 bytes/weight) to int8 or int4 (1 or 0.5 bytes/weight). A 70B model at fp16 requires ~140GB; at int4, ~35GB, fitting on a single A100 with room for KV-cache.

GPTQ (Post-Training Quantization): One-shot quantization calibrated on a small dataset. Int4 GPTQ models lose ~1–3 perplexity points versus fp16 on standard benchmarks, which is acceptable for most production use cases. Inference is faster because memory bandwidth requirements drop by 4x.

AWQ (Activation-aware Weight Quantization): Identifies weight channels that are sensitive to quantization (based on activation magnitudes) and protects them at higher precision. AWQ int4 outperforms GPTQ int4 by ~0.5–1 perplexity points on most benchmarks.

Tradeoff to know: quantization reduces memory pressure and increases throughput but adds ~2–5ms latency per token for dequantization on some hardware. On H100s with native int8 tensor cores, this overhead is negligible.

Speculative Decoding

Autoregressive generation is sequential: each token depends on the previous one. Speculative decoding breaks this by using a small "draft" model (7B) to propose k tokens ahead, then verifying all k tokens in a single forward pass of the large "target" model (70B).

If the target model accepts all k draft tokens, you've produced k tokens for roughly the cost of one. If some are rejected, you discard from the first rejection and continue normally.

Expected speedup: 2–3x on tasks where the draft model is usually correct (translation, summarization, code completion with clear patterns). Near-zero benefit on tasks requiring complex reasoning where draft tokens are frequently rejected.

Tensor Parallelism and Pipeline Parallelism

When a model doesn't fit on a single GPU, you need model parallelism.

Tensor parallelism splits individual weight matrices across GPUs. Each GPU computes a shard of each matrix multiplication, then synchronises via all-reduce. Scales well to 8 GPUs; beyond that, all-reduce communication overhead dominates. Megatron-LM and vLLM both implement this.

Pipeline parallelism assigns different transformer layers to different GPUs. Each GPU processes its layers sequentially, then passes activations to the next GPU in the pipeline. Communication overhead is lower (only activations cross GPU boundaries, not weight gradients), but pipeline bubbles (idle time waiting for the previous stage) reduce utilisation. Best for large models across many GPUs where tensor parallelism communication becomes a bottleneck.

Production Numbers

On a single A100 80GB with vLLM and PagedAttention:

  • Llama 3.1 70B int4: ~18 tokens/second at batch size 1, ~120 tokens/second at batch size 8
  • Serving cost: ~$2.50/hour on AWS, ~$0.002/1k tokens at moderate throughput

On two A100s with tensor parallelism + fp16:

  • Llama 3.1 70B fp16: ~22 tokens/second at batch size 1, ~95 tokens/second at batch size 4

For most production deployments, int4 AWQ + vLLM on a single A100 is the cost-optimal configuration unless you have strict quality requirements that quantization violates.

Further Reading