System: Distributed Cache

Reading time: ~20 minutes · Prerequisites: Linked List, Arrays & Hashing, Math, Matrix & Heaps · Capstone: lru-cache

The product behind the system

Redis powers cache layers at GitHub, Twitter/X, Stack Overflow, Snap, and tens of thousands of production deployments. The same in-memory key-value engine handles Reddit's session store, Pinterest's feed cache, and the rate-limit counters in front of every Stripe API endpoint. Salvatore Sanfilippo's original 2009 design (single-threaded event loop, copy-on-write persistence, simple text protocol) has held up across three orders of magnitude of growth.

Memcached, the predecessor, still runs the bulk of Facebook's cache fleet. The Facebook paper "Scaling Memcached at Facebook" (NSDI 2013) is the most cited industrial reference on running a cache fleet at scale and is required reading for anyone designing cache infrastructure. The paper describes a deployment of tens of terabytes of cache in front of a sharded MySQL backend, serving over a billion requests per second at peak.

Cloudflare Workers KV is a globally replicated key-value store accessible from every edge POP within 100 ms of any user on Earth. It is a cache by another name: writes are eventually consistent globally; reads are local and microsecond-fast. It powers session storage and feature-flag distribution for thousands of Cloudflare customers.

Caffeine is the in-process JVM cache that displaced Google's Guava cache as the default high-performance cache in Java applications. The Caffeine paper and benchmarks are the modern reference for window-TinyLFU, an admission and eviction policy that beats classical LRU on most realistic workloads. If you run a JVM service, your in-process cache is almost certainly Caffeine.

This primer is about the design decisions behind these systems: write policies, eviction policies, sharding via consistent hashing, and what breaks when you go from one node to a hundred to ten thousand.

What the requirements actually are

Functional requirements:

Get and set keys by string identifier; common operations include GET, SET, DEL, INCR, EXPIRE.
Support TTL (time-to-live) per key. Caches without TTL are the primary cause of stale-data incidents.
Support eviction when memory is full; the policy must be choosable per workload.
Provide consistency guarantees: at minimum, single-key read-your-writes within a region; ideally a documented model for cross-region behaviour.
Handle the cache miss correctly: do not stampede the origin when the cache cold-starts.

Non-functional requirements:

Throughput: 1M operations per second per node is the bar Redis and Memcached set on commodity hardware. A 100-node cluster handles 100M ops per second.
Latency: p99 of a local cache read under 1 ms; sub-100 microseconds on the same host. Network adds the network's RTT.
Durability: caches are derived state; data loss on restart is acceptable for most caches. Persistent caches (Redis with AOF, Workers KV) trade speed for durability.
Consistency: most caches are eventually consistent. Some workloads need linearisability (rate-limit counters, session ownership); those use cache primitives with stronger guarantees (Redis WAIT, Workers Durable Objects).
Cost ceiling: cache memory is expensive (RAM, not disk). A 1 TB cache fleet costs $10K-$30K per month; design pressure is always toward smaller working sets.

The architect's framing

A distributed cache decomposes into four major components plus an orchestration layer. The architect's commitments before drawing the boxes: decide consistency model, decide write policy, decide eviction policy, decide sharding strategy. Everything else flows from those four.

                              +------------------+
   read   ---- key ----->     |  client / SDK    |
                              | (hash partition) |
                              +--------+---------+
                                       |
                          +------------+-------------+
                          v                          v
                  +--------------+            +-------------+
                  |  shard N     |            |  shard M    |   ... (1..K shards)
                  | (RAM store)  |            | (RAM store) |
                  +------+-------+            +------+------+
                         |                           |
                  on miss|                           | on miss
                         v                           v
                  +------------------------------------------+
                  |          origin (DB / service)           |
                  +-----------------+------------------------+
                                    |
                                    v (write-through OR async)
                                  cache

Decisions encoded in this picture:

Sharding sits in the client (or in a smart proxy like Envoy / Twemproxy). The cache nodes themselves know nothing about each other in the classic Memcached design; coordination is the client's job. Redis Cluster moves shard routing into the server, but the principle is the same.
Write policy is which arrow exists between origin and cache after a write: write-through, write-back, or write-around. The choice changes everything downstream.
Eviction policy decides which key leaves when memory fills. LRU is the classic; LFU, ARC, and TinyLFU beat it on many workloads.
Consistency between nodes is usually eventual; if it is not, you have built a distributed database, not a cache.

The trade-offs we will name

1. Write-through vs write-back vs write-around.

Choice: depends on the workload. Write-around for read-heavy data with low write frequency (product catalogues, user profiles); write-through for data the user expects to see immediately after writing (their own profile edits); write-back rarely, for high-throughput counter increments where durability lag is acceptable.
Alternative considered: a single policy everywhere.
Why per-workload wins: a single policy is wrong somewhere. Write-around for the user's own profile means they see stale data after their edit; write-through for high-volume counters doubles the cost of every increment.
Write-through: write hits the cache and the origin synchronously. Slower writes, fresh reads.
Write-back: write hits the cache; origin is updated async. Fast writes, durability risk on cache crash.
Write-around: writes go to origin only; cache is populated on read miss. Avoids polluting cache with one-shot writes.
The architect's discipline: name the workload, name the policy, document why.

2. LRU vs LFU vs TinyLFU.

Choice: TinyLFU (or its window variant W-TinyLFU) for general workloads, classical LRU only when memory is tight or the workload is uniformly recency-biased.
Alternative considered: classical LRU everywhere, the default since the 1960s.
Why TinyLFU wins on most workloads: LRU is blind to frequency. A one-time scan of a million items will evict the entire cache; classical LRU has no defence against this. TinyLFU keeps a sketch of recent access frequency and only admits a candidate if its frequency is at least the victim's. The classical scan-resistant variants (ARC, 2Q, clock-pro) achieve similar wins through different mechanisms. The Caffeine benchmarks show 10-30% hit-rate improvements over LRU on realistic workloads.
When LRU wins: small caches where the bookkeeping overhead of TinyLFU is not worth the hit-rate gain, and pure-recency workloads (a session cache where the most recent N sessions are always the right ones to keep).

3. Consistent hashing vs modulo hashing.

Choice: consistent hashing with virtual nodes.
Alternative considered: hash(key) mod N where N is the number of shards.
Why consistent hashing wins: when you add or remove a shard, modulo hashing remaps every key. Consistent hashing remaps only 1/N of the keys. For a 100-shard cluster losing one node, the difference is 99% of keys remapped vs 1%. The 99% remap triggers a stampede on the origin as the cache cold-starts everywhere. Caches without consistent hashing cannot be scaled in production.
When modulo wins: never, for production caches. Modulo is fine for in-process caches where the shard count is fixed (Caffeine's segment count, ConcurrentHashMap's bin count). Anything that grows or shrinks needs consistent hashing.

4. In-process cache vs out-of-process cache.

Choice: layered. A small in-process L1 cache (Caffeine in JVM, in-memory lru_cache in Python) backed by a shared out-of-process L2 cache (Redis, Memcached).
Alternative considered: only out-of-process, or only in-process.
Why layered wins: in-process is sub-microsecond; out-of-process is sub-millisecond. The fast path (L1 hit) serves 80% of traffic at the speed of a hash-map lookup; the medium path (L1 miss, L2 hit) serves another 15% at network speed; the slow path (both miss) hits the origin and back-fills both layers. Twitter, Facebook, and Netflix all run this pattern. The penalty is invalidation complexity: how do you tell every process's L1 that key X has changed? The standard answer is a pub-sub channel or a TTL short enough to make stale L1 entries tolerable.
When out-of-process only wins: stateless serverless deployments (Lambda, Workers) where instances do not live long enough for an L1 to matter.

5. Eventual consistency vs strong consistency in the cache.

Choice: eventual consistency for the cache itself; strong consistency is bolted on by the application (e.g. "read from origin if the request needs strong consistency").
Alternative considered: strongly consistent cache (linearisable reads and writes).
Why eventual wins: a strongly consistent cache is a distributed database, not a cache. It must coordinate writes across replicas, which kills the latency advantage that justified the cache in the first place. The point of a cache is to be fast and approximate; if you need it to be slow and exact, you need a database.
When strong consistency wins: critical state where staleness is unacceptable and the workload is small enough that the latency cost is acceptable (session ownership, distributed locks, leader election). For these, use a coordination primitive (etcd, ZooKeeper, Workers Durable Objects), not a cache.

Where the algorithms from this bank actually appear

The LRU cache itself is the lru-cache primer at production scale: a hash map + doubly-linked list. Every cache implementation in Redis, Memcached, Caffeine, Cloudflare Workers KV, and the Linux page cache uses this skeleton or an explicit refinement of it (LFU adds frequency counters per node; ARC adds two LRU lists with adaptive sizing). The architect who has internalised the Linked List primer's sentinel-and-pointer discipline can read any of these implementations and understand the bug surface.

The hash partitioning in the client is the Arrays & Hashing primer at distributed scale. The classical dict becomes a consistent-hash ring (a sorted array of (hash, node) tuples; the request hashes to a value and finds the next-greater node via binary search). The same primitive primer underwrites both.

The top-K hot key tracker (which keys are getting hammered? which to pin in a hot-cache tier?) is the Math, Matrix & Heaps pattern. top-k-frequent-elements is the literal template; the production version uses Count-Min Sketch instead of a hash map for the frequency estimate when the key cardinality is very large.

The scan-resistant admission policy in TinyLFU uses a sliding-window counter that is the Sliding Window primer in disguise. The frequency sketch is essentially a fixed-window count with a periodic decay; the minimum-size-subarray-sum trick of maintaining a running sum without re-scanning the window is the same trick.

Sketch implementation

A simplified in-process LRU with TTL, in the spirit of Caffeine but stripped to fundamentals:

import time
from threading import Lock

class Node:
    __slots__ = ('key', 'val', 'expires_at', 'prev', 'next')
    def __init__(self, key=None, val=None, expires_at=None):
        self.key, self.val, self.expires_at = key, val, expires_at
        self.prev = self.next = None

class LruCache:
    def __init__(self, capacity: int):
        self.cap = capacity
        self.map = {}                                    # key -> Node
        self.head = Node()                               # sentinel: MRU side
        self.tail = Node()                               # sentinel: LRU side
        self.head.next = self.tail
        self.tail.prev = self.head
        self.lock = Lock()

    def _unlink(self, node):
        node.prev.next = node.next
        node.next.prev = node.prev

    def _push_front(self, node):
        node.prev = self.head
        node.next = self.head.next
        self.head.next.prev = node
        self.head.next = node

    def get(self, key):
        with self.lock:
            node = self.map.get(key)
            if node is None:
                return None
            if node.expires_at and time.monotonic() > node.expires_at:
                self._unlink(node)
                del self.map[key]
                return None
            self._unlink(node)
            self._push_front(node)
            return node.val

    def set(self, key, val, ttl_seconds=None):
        with self.lock:
            expires_at = (time.monotonic() + ttl_seconds) if ttl_seconds else None
            node = self.map.get(key)
            if node is not None:
                node.val = val
                node.expires_at = expires_at
                self._unlink(node)
                self._push_front(node)
                return
            node = Node(key, val, expires_at)
            self.map[key] = node
            self._push_front(node)
            if len(self.map) > self.cap:
                lru = self.tail.prev
                self._unlink(lru)
                del self.map[lru.key]

For the distributed version, the client wraps this with consistent hashing:

import bisect
import hashlib

class Cluster:
    def __init__(self, nodes, vnodes=128):
        self.ring = []                                   # sorted list of (hash, node)
        for n in nodes:
            for v in range(vnodes):
                h = self._hash(f"{n}#{v}")
                self.ring.append((h, n))
        self.ring.sort()

    def _hash(self, s):
        return int(hashlib.sha1(s.encode()).hexdigest()[:8], 16)

    def node_for(self, key):
        h = self._hash(key)
        i = bisect.bisect_left(self.ring, (h,))
        if i == len(self.ring): i = 0
        return self.ring[i][1]

Invariants:

_unlink and _push_front operate on nodes with non-null neighbours, guaranteed by the sentinels.
TTL check happens on get, not on every key; lazy expiration avoids a full scan.
The consistent-hash ring is sorted and uses binary search; lookup is O(log V*N) where V is virtual nodes per physical node and N is physical nodes.
Eviction is O(1) because the LRU node is tail.prev, found in constant time.

What the code omits but production needs: a background thread to clean up expired entries (lazy expiration alone leaves dead entries in memory), per-key concurrency to avoid the single global lock becoming the bottleneck (Caffeine uses segmented locks with hashed bucketing), TinyLFU's admission filter, network protocol parsing, and metrics export.

What breaks at scale

One node, 1 GB of cache: an in-process LRU like the sketch above is correct and sufficient. The architect's worry is hit rate and TTL strategy, not infrastructure.

One node, 100 GB of cache: Redis or Memcached on a single beefy machine. The fleet is one server; the operational model is restart-and-cold-cache-on-failure. The architect plans for a cold-cache stampede (the origin gets hammered the moment the cache restarts) by pre-warming critical keys or by request coalescing at the application layer.

Cluster of 10 nodes: consistent hashing in the client. Replication factor of 2 or 3 for high-traffic keys. Stampede protection (the "single-flight" pattern, where concurrent misses for the same key collapse into one origin call) becomes essential. Twitter and Facebook both publish techniques for this.

Cluster of 100 nodes: hot keys become a problem (one viral key gets 10× the traffic of any other). The cluster needs per-key replication (a hot key is replicated to multiple shards; reads are load-balanced; writes are broadcast). The Facebook Memcached paper introduces "regional caching" and "cold cluster warmup" for exactly this scale.

Multi-region cache, 10K nodes (Cloudflare Workers KV): the cache is geographically distributed. Writes are eventually consistent globally (typically 60 seconds for global propagation). The architect designs around the staleness budget: features that tolerate 1 minute of staleness use KV; features that need strong consistency use Durable Objects or a coordinated store.

Named failure modes:

Cache stampede: cache expires for a hot key; thousands of concurrent requests miss and hit the origin simultaneously. Mitigation: single-flight (one request per key proceeds to origin, others wait), TTL jitter (randomise TTL to prevent simultaneous expiration), or stale-while-revalidate (serve the stale value while a background refresh fetches the new one).
Hot key on a shard: one shard is at 90% CPU while others are at 10%. Mitigation: per-key replication, application-level key splitting (suffix the key with a random shard nonce for write-heavy keys, fan-out on read).
Cold cluster: a new region or a restart with no warm data. The origin is overwhelmed. Mitigation: replicate a snapshot from a warm region, or pre-warm with the top-K keys from access logs.
Memory fragmentation: long-running Redis instances fragment heap memory; allocator behaviour degrades. Mitigation: planned restarts on a rotation, jemalloc tuning, or migration to a defragmenting cache.
Network partition: a shard becomes unreachable. Mitigation: fail-fast clients with retry on the next replica, and a circuit breaker so the application does not block on the dead shard.

What an interview / staff-engineer review will ask

Q: A user updates their profile. They reload and see the old version. Where is the bug? A: Cache invalidation. Three places to check: (1) write policy: did the write update the cache, or only the origin? Write-around without explicit invalidation is the most common cause. (2) Multi-layer cache: the L2 was invalidated but the L1 in the user's web server still has the stale copy; an L1 pub-sub invalidation is needed. (3) Read-your-writes routing: the user's read landed on a replica that has not yet received the write. A common fix: route writes and immediate reads to the same shard (sticky sessions per user).

Q: How do you stop a hot key from saturating one shard? A: Detect it (top-K request counter per key), then replicate it: write to multiple shards under different physical keys (user:42:shard0, user:42:shard1, ...), and have the client pick a random replica on read. The downside is fan-out on write; the upside is per-shard load drops to 1/replicas.

Q: Your cache hit rate is 70%. How do you raise it to 90%? A: First, measure the miss type breakdown: compulsory, capacity, or conflict? Compulsory (the first request for a key) is unavoidable. Capacity (the cache is too small) is solved by adding memory or by a better eviction policy (TinyLFU > LRU). Conflict (the wrong keys are being evicted) is solved by inspecting access patterns and tuning admission. A second lever: longer TTLs where staleness is tolerable. A third: pre-warming hot keys on cache start.

Q: What is the consistency model of your cache between writes and reads? A: For a single-key, single-region cache: read-your-writes within the same client (the write hits the cache, the read after it sees the new value). Across clients: eventually consistent, bounded by TTL. Cross-region: depends on the replication strategy. Workers KV is eventually consistent with a 60-second budget; DynamoDB DAX is eventually consistent with a per-table policy. The architect should state the model explicitly to avoid surprises.

Q: How do you safely change the shard count of a running cache cluster? A: With consistent hashing, adding a shard moves 1/N of keys to the new shard on next access (they miss, hit origin, and back-fill). Without consistent hashing (hash mod N), the entire cache is invalidated and the origin is hammered. With consistent hashing plus replication, you can pre-fill the new shard from its peers before traffic moves to it (Redis Cluster's MIGRATE command does this). Operational discipline: change shard counts during off-peak, monitor origin load, abort if it spikes.

In the AI-integrated workspace

AI is reasonably good at writing the mechanics of caches (the LRU sentinel discipline, the consistent-hash ring) and surprisingly bad at the architectural decisions (which write policy, when to use replication, what consistency contract to publish).

What the architect uses AI for:

Boilerplate cache clients: connection pooling, retry logic with exponential backoff and jitter, serialisation/deserialisation wrappers. AI is a multiplier here.
Stress tests: AI is good at generating synthetic load patterns (Zipf distribution of keys, bursty write patterns) for benchmark suites.
Migration scripts: moving a cache from Memcached protocol to Redis protocol, or from a single-shard setup to a sharded cluster, is largely mechanical and AI handles it well with a clear spec.

What the architect audits personally:

The invalidation strategy. Cache invalidation is famously the second of the two hard problems in computer science. AI defaults to "set a TTL"; the architect demands a clearer plan: write-through, write-around with explicit invalidation, or a pub-sub bus carrying invalidation events. Get this wrong and stale data is everywhere.
The stampede protection. AI rarely writes single-flight by default; without it, a viral key takes down the origin. The architect checks for the locking discipline that ensures only one concurrent miss per key reaches the origin.
The TTL choice. AI defaults to "1 hour"; the right TTL is workload-specific. The architect demands a justification per key class (catalogue: 24h; user session: 30m; rate-limit counter: window-aligned).
The hot-key handling. AI will not write replication logic unless asked; the architect asks at design time, not at incident time.

What the architect refuses from AI:

Any "cache" that does a database query on miss without single-flight protection. That is a denial of service primitive.
Any cache writes that happen before the origin write succeeds (the "fail-open cache" pattern is correct only with write-around; write-through writes the origin first).
Any consistent-hash implementation that uses hash mod N and calls it consistent hashing. The architect grep-reviews for this; it is a common AI mistake.
Any cache that stores PII without encryption. Cache memory is durable enough (in disk-backed Redis) to be a regulatory boundary; the same encryption discipline as the primary store applies.

Variants and adjacent systems

CDN edge cache (Cloudflare, Akamai, Fastly, CloudFront): a distributed cache for HTTP responses, geographically replicated. The eviction is LRU-derived; the keying is the URL + cache headers; the protocol is HTTP. Same primitives, different layer.
Browser HTTP cache: a single-process cache in every browser. Same LRU + TTL skeleton; the protocol is HTTP cache-control headers; the eviction is per-origin.
CPU L1/L2/L3 caches: the same LRU pattern in hardware (pseudo-LRU, NRU). The cache line is 64 bytes; the eviction is a hardware FSM; the principle is identical.
OS page cache (Linux, Windows): an LRU-derived cache of file blocks in RAM. Linux uses two LRU lists (active, inactive) and a writeback thread; the architecture is documented in Documentation/admin-guide/sysctl/vm.rst.
Database buffer pool: MySQL InnoDB and Postgres both run an LRU over disk pages with a "young/old" sublist split to avoid scan-pollution. The structure is the same LRU; the integration with WAL and checkpointing is what makes it complex.
Materialised view (Snowflake, BigQuery, Postgres MATERIALIZED VIEW): a pre-computed cache of query results, refreshed by policy. Same write-through / write-around debate, longer TTLs, query-language interface instead of key-value.
Distributed key-value store (DynamoDB, Cassandra): adjacent territory. A cache without durability becomes a KV store with durability; the LRU eviction becomes a disk-backed engine; the consistent hashing becomes a partition strategy. The architectural genealogy is direct.