System: Rate Limiter
Reading time: ~20 minutes · Prerequisites: Intervals, Sliding Window, Arrays & Hashing · Capstone: rate-limiter design problem (parikshan systems track)
The product behind the system
Cloudflare's edge fleet absorbs and decides on roughly 50 million HTTP requests per second on a normal day. Every one of those requests passes through a rate-limit decision before it sees an origin server. The decision must be made in single-digit microseconds on commodity hardware, must survive a rule update applied to thousands of edge POPs simultaneously, and must be correct enough that an attacker cannot exploit a counter desync to slip past the threshold. That is the constraint envelope you are designing against.
Stripe enforces a rate limit on its API: 100 read requests per second per account on the live keys, with burst allowances and per-endpoint overrides. The limit is enforced server-side in Stripe's API gateway, returns 429 Too Many Requests with a Retry-After header, and is correct to within a few requests across a fleet of API servers spread across multiple regions. The classic Stripe engineering post on rate limiting (2017) is the most cited industrial reference on the topic; it explicitly describes a token-bucket implementation backed by Redis.
GitHub's REST API limits unauthenticated callers to 60 requests per hour per source IP and authenticated callers to 5,000 per hour per token. The response headers X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset are part of the API contract. Discord, Twilio, Shopify, and AWS API Gateway all expose the same conceptual headers; the algorithm behind them varies.
This primer is about how the architect picks between four mainstream algorithms (token bucket, leaky bucket, sliding window log, sliding window counter), where each one fails, and how to build the system at the scale Cloudflare actually runs.
What the requirements actually are
Functional requirements:
- Given a request with a key (account ID, IP, API token, user ID), decide allow vs deny within a budget.
- Return remaining capacity and reset time on every response, accurately.
- Support per-key, per-endpoint, per-tier limits; a free user, a paid user, and an enterprise user share infrastructure but see different ceilings.
- Allow operators to update limits at runtime without redeploy.
- Support burst tolerance: real traffic is bursty; a strict per-second clamp is hostile to legitimate clients.
Non-functional requirements:
- Throughput: 50M decisions per second globally, 100K per second per edge POP.
- Latency: p99 of the rate-limit decision under 1 ms; p999 under 5 ms. The limiter is on the hot path of every request.
- Durability: counters are fine to lose on a node restart for in-memory limits, but billing-tier limits must survive failover (you do not want a Redis crash to give every paid customer free traffic for a minute).
- Consistency: weakly consistent across regions is acceptable for soft limits (a 1,000 rps user might briefly do 1,050 rps if their traffic shifts regions). For hard limits (anti-abuse, anti-credential-stuffing) you need stronger consistency on a small hot-path counter.
- Cost ceiling: a rate-limit decision must cost less than serving the request it gates. If the limiter costs more than the origin, it is a denial of service tool against yourself.
The architect's framing
A rate limiter at scale decomposes into five components:
+-----------------+
request ---> | edge filter |--- allow --> origin
| (Worker / NGX) |
+--------+--------+
|
v
+-----------------+
| local counter | (in-memory, per-POP)
| (token bucket) |
+--------+--------+
|
async v
+-----------------+
| regional store | (Redis / Memcached)
| (shared truth) |
+--------+--------+
|
async v
+-----------------+ +-----------------+
| global control |<------->| config plane |
| plane (DB) | | (rule updates) |
+-----------------+ +-----------------+
Two control loops:
- Hot path: edge filter consults the local counter; the local counter consults the regional store on counter miss or sync interval. Origin is reached only if the decision is allow.
- Slow path: config plane pushes new rules to all edges; regional stores reconcile drifted counters; metrics flow up for dashboards and anomaly detection.
The algorithm choice (token bucket vs leaky bucket vs sliding window) belongs in the local-counter box. Everything else is the same regardless of which one you pick. That separation is the architect's superpower: most engineers debate the algorithm in isolation; the architect names the box it lives in and the contract it satisfies.
The trade-offs we will name
1. Token bucket vs leaky bucket.
- Choice: token bucket for an API rate limiter.
- Alternative considered: leaky bucket (queue-based, smooths output to a constant rate).
- Why token bucket wins: APIs receive bursty traffic and clients expect bursts to be allowed. A user who hits 10 requests in 50 ms after 5 seconds of silence is legitimate (a page load, a batch upload). Token bucket allows that burst because tokens accumulate; leaky bucket forces a flat output rate and would queue or reject the burst. Token bucket is also branch-free in steady state; leaky bucket needs a timer per bucket.
- When leaky bucket would win: traffic shaping in network equipment (you want a strict output rate to a downstream that cannot handle bursts). Cloudflare uses leaky-bucket logic in some downstream-protection paths for exactly this reason.
2. Sliding window log vs sliding window counter.
- Choice: sliding window counter for high-volume keys.
- Alternative considered: sliding window log (store a timestamp for every request, count how many are inside the window).
- Why sliding window counter wins: log storage grows with throughput. A 50K rps key with a 60-second window stores 3 million timestamps per key. Multiply by a million keys and the memory footprint is catastrophic. The counter approximation stores two buckets per key (previous and current window) and interpolates; memory is O(keys), not O(keys × throughput).
- When the log would win: forensic / billing use cases where you must know the exact request times after the fact (Stripe billing, AWS metered APIs). There the log is the source of truth and the counter is a fast approximation.
3. Centralised counter vs distributed counter.
- Choice: distributed counter with periodic sync (Cloudflare's approach).
- Alternative considered: centralised counter in a single Redis cluster (Stripe's approach).
- Why distributed wins at Cloudflare scale: a single Redis cannot serve 50M decisions per second, and global round-trip latency for the limit decision (US east → EU west) would dominate the p99 budget. Each POP keeps a local counter; counters reconcile asynchronously. The penalty is bounded overshoot: a 1,000 rps user might do 1,050 rps for a window while POPs are out of sync.
- When centralised wins: Stripe's volume (millions of decisions per second, not billions) fits in a sharded Redis cluster. The consistency benefit (exact counts, no overshoot) is worth the latency penalty (1-2 ms per decision). Stripe explicitly chose this.
4. Per-key precision vs aggregate sampling.
- Choice: per-key counters for high-value keys, sampled aggregates for the long tail.
- Alternative considered: per-key counters for everything.
- Why hybrid wins: 99% of traffic comes from 1% of keys (the Pareto distribution is brutal in API traffic). Per-key counters for every IP that ever hit you is a memory crisis; per-key for the heavy hitters and Count-Min Sketch for the tail is the production answer. Count-Min Sketch over-approximates by design; if it says "this key is over the limit", you check the precise counter; if it says "under", you trust it.
- When pure per-key wins: small fleets, low-throughput services, or anti-abuse limits where false negatives are unacceptable (you cannot under-count credential-stuffing attempts).
5. Synchronous deny vs asynchronous deny.
- Choice: synchronous deny (return 429 immediately).
- Alternative considered: asynchronous deny (queue the request and respond later with a 429 if it would have been rejected).
- Why synchronous wins: queueing rejected work is doing work for the attacker. The attacker wants you to spend CPU on their requests; the queue lets them. Synchronous deny in the edge filter is the cheapest possible response: a static 429 with a
Retry-Afterheader, no database touch, no origin hit. This is exactly the "early reject" rule from the global engineering standards: ~200 bytes of response, microseconds of CPU.
Where the algorithms from this bank actually appear
The token bucket is the canonical sliding-window problem in disguise. The bucket holds min(capacity, tokens + (now - last_refill) * rate) tokens at any instant. That formula is a closed-form sliding window with a single accumulator instead of a queue of events. When you understand the trick in minimum-size-subarray-sum, the trick where a single running sum replaces an inner loop, you understand why token bucket is fast: it collapses an O(n) window scan to O(1) arithmetic.
For the sliding window log, the data structure that makes it tractable is the monotonic deque from sliding-window-maximum. You push the current request timestamp to the back, pop expired timestamps from the front, and the deque length is the current request count. The discipline that primer teaches (maintaining order while you push and pop) is the same one that makes the rate-limit log correct.
For per-key state lookup, the Arrays & Hashing primer is the foundation. You will hash on (account_id, endpoint, window) to find the bucket for this request. Stripe's blog post explicitly mentions a Redis sorted set keyed this way; the hash function and key design is where most homegrown rate limiters bottleneck.
For enforcing the rate per peak concurrent requests (a different kind of limiter, e.g. AWS Lambda's concurrent-execution cap), the events sweep from merge-intervals is the model: each request is an [start, end] interval; the concurrent count is the running event count. This is the bridge between "rate per second" and "concurrent in-flight"; they are different problems that look superficially similar.
Sketch implementation
A single-node token bucket in production-quality Python:
import time
from threading import Lock
class TokenBucket:
__slots__ = ('capacity', 'rate', 'tokens', 'last', 'lock')
def __init__(self, capacity: float, rate: float):
self.capacity = capacity # max tokens (burst size)
self.rate = rate # tokens added per second
self.tokens = capacity # start full
self.last = time.monotonic()
self.lock = Lock() # one bucket per key; lock is per-bucket
def allow(self, cost: float = 1.0) -> bool:
with self.lock:
now = time.monotonic()
elapsed = now - self.last
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last = now
if self.tokens >= cost:
self.tokens -= cost
return True
return False
For the distributed case, replace self.tokens with a Redis Lua script that does the same arithmetic atomically:
-- KEYS[1] = bucket key; ARGV = capacity, rate, now, cost
local b = redis.call('HMGET', KEYS[1], 'tokens', 'last')
local tokens = tonumber(b[1]) or tonumber(ARGV[1])
local last = tonumber(b[2]) or tonumber(ARGV[3])
local capacity = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
tokens = math.min(capacity, tokens + (now - last) * rate)
local allowed = tokens >= cost
if allowed then tokens = tokens - cost end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'last', now)
redis.call('EXPIRE', KEYS[1], 3600)
return {allowed and 1 or 0, tokens}
Two invariants the Lua script preserves that a naïve GET / SET cannot: (1) the refill computation and the deduction happen in the same atomic step; (2) the last timestamp is always advanced even on a deny, so the refill rate is honoured. A common bug in homegrown limiters is to skip the timestamp update on deny, which gives the user free tokens on retry.
What breaks at scale
10K req/s, single region: a single Redis instance handles everything. Latency is dominated by the network hop to Redis (~1 ms). The architect's worry here is correct key design and TTL strategy, not algorithmic.
100K req/s, single region: Redis pipelining becomes essential. A naïve one-command-per-request pattern saturates the Redis CPU. Pipeline by client, batch 10-50 commands per round trip, p99 stays under 2 ms. Sharding by key prefix begins (typically 4-16 Redis shards).
1M req/s, multi-region: Centralised Redis no longer works. Cross-region traffic blows the latency budget. Each region runs its own Redis cluster and reconciles asynchronously. Bounded overshoot enters the design vocabulary; the architect now has to explain to product that a "1000 rps limit" means "1000 rps on average, up to 1050 rps for 100 ms during regional rebalance". This is a product conversation, not an engineering one.
50M req/s, planet-scale: Cloudflare territory. The per-POP local counter is the source of truth for most decisions; the regional store is consulted only on counter miss or sync interval. Count-Min Sketch handles the long-tail keys. Network telemetry feeds back into rule tuning. The hot path is branch-predicted, lock-free, and SIMD-vectorisable; the algorithm is implemented in Rust or C++ and embedded in the edge proxy (e.g. Cloudflare's quiche and pingora). At this scale the architect cares about the CPU cycles per decision, not the algorithm.
Named failure modes:
- Counter desync: regional counters drift; a sophisticated attacker bursts across regions to exploit the lag. Mitigation: per-region budgets that sum to the global budget, not a global counter shared across regions.
- Hot key: one viral account triggers all limits everywhere. Mitigation: sharded counters for VIP keys (split a single key into 64 sub-counters, sum on read).
- Stampede on reset: every key resets at the top of the second; the next-second's traffic is 50× heavier than the previous. Mitigation: jittered windows (each key has its own randomised window offset).
- Limit-update lag: an operator updates a rule; some edges still enforce the old limit for minutes. Mitigation: versioned rules with a TTL on the rule cache, and a kill switch that bypasses the cache.
What an interview / staff-engineer review will ask
Q: Why token bucket and not leaky bucket for the API rate limiter? A: APIs need burst tolerance and clients expect it. Token bucket allows bursts up to capacity; leaky bucket forces a steady output rate, which is wrong for the API use case but right for downstream traffic shaping. If a staff engineer hears "leaky bucket" for an API limiter without justification, that is a red flag.
Q: How do you handle the multi-region case where a user's traffic shifts between regions? A: Either accept bounded overshoot (each region enforces a per-region share of the global budget) or pay the latency cost of a globally consistent counter. The product conversation is which one matters: a 5% overshoot on a soft limit is fine; a 5% overshoot on a credential-stuffing block is a security incident. The choice depends on what the limit protects.
Q: How do you stop the limiter itself from becoming the bottleneck? A: Three levers: (1) put the decision in-process at the edge, not over the network; (2) keep counters in memory, sync asynchronously; (3) reject early, since a 429 should cost less than serving the underlying request. If your limiter does a DB lookup for every request, you have built a DoS amplifier, not a limiter.
Q: How do you protect against an attacker rotating keys to evade per-key limits? A: Layered limits: per-key, per-IP, per-IP-prefix, per-ASN, per-geo. An attacker rotating 10,000 keys will share an ASN or geographic origin; the higher-aggregate limit catches them. Anomaly detection on rate-limit-rejection patterns also feeds into the WAF. Pure per-key is not enough by itself.
Q: How would you build this if you could not use Redis? A: Local in-process token buckets per node, gossip-protocol counter synchronisation between nodes (Cassandra-style anti-entropy), and a slow-path store for billing-tier limits in a SQL database. This is closer to how Cloudflare actually built it; Redis is the easy answer, not the only one. The staff engineer wants to hear that you have other tools in the toolbox.
In the AI-integrated workspace
AI-generated rate limiters typically fail in three predictable ways:
The first failure: the agent will produce a counter that increments on the request rather than decrementing a token bucket. Counter limiters are correct for fixed windows but suffer the boundary problem (twice the limit at the boundary between two windows). The architect's job is to ask the agent for a token-bucket implementation specifically and to check that the formula is tokens + elapsed * rate, not count + 1.
The second failure: the agent will write the timestamp update inside the "allow" branch only. On a deny, last is not updated; the next call gets a huge elapsed and refills the bucket. The user gets free tokens on every rejected request. Always read the Lua script and check that last is updated unconditionally.
The third failure: the agent will use Redis INCR with EXPIRE and call it a sliding window. This is a fixed window with reset, which is the worst rate limiter (boundary problem visible at every window edge). If you see INCR / EXPIRE for a sliding-window claim, the implementation is wrong.
What an architect audits personally:
- The Lua script (or equivalent atomic operation) line-by-line. The atomicity boundary is where every bug hides.
- The key design: what is the cardinality of the key? A key keyed on
request_idwill have one bucket per request and the limiter will do nothing. This sounds dumb but happens. - The default action on Redis unreachable: should the limiter fail open (allow everything) or fail closed (deny everything)? For an API gateway, fail open is correct (you do not want a Redis outage to take down all customer integrations). For an anti-abuse limiter, fail closed is correct (you do not want a Redis outage to let credential stuffers through). The AI will guess; the architect decides.
What an architect refuses from AI:
- Any limiter where the decision involves a hot SQL query. The right primitive is in-memory plus a Redis-like store. SQL on the hot path is a bug, regardless of how the agent justifies it.
- Anything that stores per-request timestamps in a list, indefinitely, "for accuracy". That is a memory leak and an OOM crash in production.
- Anything that uses wall-clock time (
time.time()) where monotonic time was required. Wall-clock can move backwards (NTP adjust, leap second); the bucket'slastbecomes a time in the future and the bucket never refills again. Alwaystime.monotonic().
Variants and adjacent systems
- Concurrency limiter (semaphore): limits in-flight requests rather than rate. Used by AWS Lambda's concurrent-execution cap, by database connection pools, by the bulkhead pattern in service meshes.
- Adaptive rate limiter: the limit shifts based on backend health. If latency rises, the limit tightens automatically. Netflix's concurrency-limits library is the canonical implementation.
- Quota system: per-day or per-month limits rather than per-second. Stripe's API has both; GitHub's hourly limit is a quota, not a rate. The data structure is the same; the window is different.
- Distributed lock: not a rate limiter, but builds on the same atomic Redis primitives. Confusingly similar code; clarify which one you mean in interviews.
- Circuit breaker: protects the caller of a degraded service; rate limiter protects the callee. Pair them: a downstream rate-limited returns 429, the upstream's circuit breaker opens, and the load shed propagates correctly.
- Backpressure (reactive streams, gRPC flow control): a protocol-level rate limit baked into the transport. The TCP slow-start is the original backpressure mechanism; everything since is a refinement.