Rate Limiting

Why Rate Limiting Exists

Without rate limiting, a single misbehaving client can exhaust a service's capacity and degrade it for every other client. The failure modes are concrete:

A poorly written script retries a failed API call in a tight loop, sending 10,000 requests per second instead of the intended 10.
A downstream service triggers a retry storm after a 500 error, multiplying load by 50x at exactly the moment the service is struggling.
A scraper runs unconstrained and consumes 80% of your database read capacity, raising P99 latency from 20ms to 800ms for paying customers.

Rate limiting is the enforcement layer that bounds the request rate any single client, IP, or token can sustain. It does not eliminate abuse, but it contains the blast radius.

Algorithm 1: Token Bucket

A bucket holds a maximum of N tokens. Tokens replenish at a fixed rate R (e.g., 100 tokens per second). Each request consumes one token. If the bucket is empty, the request is rejected (or queued, depending on implementation).

The key property: the bucket allows bursting. A client that has not made any requests for 10 seconds accumulates up to N tokens and can then fire N requests instantaneously. This matches real traffic patterns, where legitimate clients send occasional bursts.

Example: N=500 tokens, R=100 tokens/second. A client can burst 500 requests at once, then is limited to 100 requests/second thereafter.

Token bucket is the algorithm behind most API gateway rate limiters: AWS API Gateway, Stripe's API, and GitHub's API all use variants of it.

Algorithm 2: Leaky Bucket

Incoming requests enter a FIFO queue (the "bucket"). Requests are processed (leak) at a fixed rate regardless of how fast they arrive. If the queue is full, excess requests are dropped.

The key property: the output rate is constant. There is no bursting. This is the right choice when downstream systems need a stable, predictable call rate, such as a rate-limited third-party API that will cut you off if you exceed 50 requests/second even momentarily.

The cost: added latency. Requests queue up rather than getting an immediate rejection. Queue depth becomes a latency variable.

Algorithm 3: Fixed Window Counter

Divide time into fixed windows of duration T (e.g., 60 seconds). Count requests per client per window. Reject requests once the count exceeds the limit.

Simple to implement with a single Redis INCR and a TTL set to T. The critical flaw: the boundary spike problem. If the limit is 100 requests per minute and a client sends 100 requests at 11:59:59 and 100 requests at 12:00:01, they send 200 requests in a 2-second window while technically obeying the per-minute limit. At the boundary, the effective limit doubles.

Algorithm 4: Sliding Window Log

Store a timestamp for every request in a sorted set (Redis sorted set keyed by client ID works well). On each request, remove all timestamps older than the window duration, count remaining entries, and reject if over limit.

Perfect accuracy: no boundary spike. The cost is memory. Storing a timestamp per request means 1 million requests per minute across your clients requires millions of sorted-set entries. At scale, the memory footprint becomes the constraint.

Algorithm 5: Sliding Window Counter

The production-grade hybrid. Maintain two fixed-window counters: the current window and the previous window. Estimate the count in the sliding window as:

current_count + previous_count * (1 - elapsed_fraction_of_current_window)

Example: current window has 40 requests. Previous window had 80 requests. 25% of the current window has elapsed. Estimated sliding window count: 40 + 80 * 0.75 = 100.

This approximation has an error rate under 1% in empirical testing (Cloudflare published their analysis). Memory cost: two integers per client per window. This is the algorithm Cloudflare uses for its rate limiting product.

Algorithm Comparison

Algorithm	Burst Handling	Boundary Spike	Memory Cost	Accuracy
Token Bucket	Yes (up to N)	No	Low (1 value)	Exact
Leaky Bucket	No (smoothed)	No	Medium (queue)	Exact
Fixed Window	Yes	Yes (2x limit)	Very low	Approximate
Sliding Window Log	Yes	No	High (1 entry/req)	Exact
Sliding Window Counter	Yes	Minimal	Very low (2 values)	~99%

Distributed Rate Limiting

A single-server rate limiter is straightforward: increment an in-memory counter. A distributed API with 20 gateway nodes is harder. If each node maintains its own counter, a client can send 100 requests to each of 20 nodes and bypass a 100 request/minute global limit entirely.

The standard solution: centralised counter in Redis.

EVAL 1

A Lua script in Redis executes atomically: check the counter, increment if under limit, reject if over. Because Lua scripts run without interruption in Redis's single-threaded command loop, there is no race condition between the check and the increment. The round-trip to Redis adds ~0.5–2ms per request depending on network topology.

For very high-throughput APIs (above 100,000 requests/second), even Redis becomes a bottleneck. The mitigation is local token buckets with periodic synchronisation: each node tracks a local bucket and syncs its consumed token count to Redis every 100ms. This introduces a ~100ms window where a client can marginally exceed the limit across nodes, but reduces Redis round-trips by 100x. Twitter's rate limiting system uses this approach.

Rate Limit Headers

Well-designed rate limiting surfaces the current state to clients:

X-RateLimit-Limit: Maximum requests allowed in the window
X-RateLimit-Remaining: Requests remaining in the current window
X-RateLimit-Reset: Unix timestamp when the window resets
Retry-After: Seconds until the client can retry (on 429 responses)

These headers let clients back off gracefully rather than retrying immediately and worsening the load.

Interview Tip

Interviewers ask this question to test two things: whether you know more than one algorithm, and whether you can reason about distributed systems. Knowing all five algorithms is table stakes. The real answer is the distributed implementation: use Redis with a Lua script for atomicity, explain why a naive INCR without Lua creates a race condition, and mention the local-bucket-with-sync pattern for high-throughput scenarios. If you can quote the sliding window counter approximation formula and say "Cloudflare found sub-1% error rate," you have signalled production-level thinking.