System: Real-Time Chat

Reading time: ~20 minutes · Prerequisites: Graphs, Intervals, Linked List · Capstone: real-time chat design problem (parikshan systems track)

The product behind the system

Discord handles ~15 million concurrent users on a normal day; that number is documented in the engineering blog. Each user maintains a long-lived WebSocket (technically the Discord Gateway, a WebSocket carrying a custom protocol) and receives presence updates, message deliveries, and voice signalling over that single connection. The published architecture has Discord running on Elixir/Erlang for the gateway tier, taking advantage of the BEAM VM's process model where a million lightweight processes per node is normal. Cassandra (then ScyllaDB, then a homegrown layer) holds the message store.

Slack runs on a similar shape: per-user WebSocket from browser to edge, fan-out from an Envoy-based mesh, message persistence in MySQL with caching layers, and search in a separate Solr/Elasticsearch fleet. The 2019 paper-equivalent talks ("Real World Slack" at QCon) document the team's transition from Hooks to Block Kit and the messaging stack that supported it.

WhatsApp at acquisition time (Facebook, 2014) ran on 35 engineers serving 450 million users. The architecture was Erlang on FreeBSD, XMPP-derived protocol, MySQL for state, and a famously minimal feature set. WhatsApp continues to publish operational details; "How WhatsApp scaled" is required reading.

Signal, Telegram, and Microsoft Teams round out the reference list. The core problem is identical: deliver a message from sender to recipient with low latency, at-least-once delivery, ordering preserved within a conversation, and history queryable on demand.

This primer is about how that stack actually fits together: how WebSockets fan out at scale, how messages are ordered when senders are in different regions, how to shard such that adding a node does not break anyone's ongoing conversation, and what breaks when concurrent connections cross 10 million.

What the requirements actually are

Functional requirements:

A user sends a message; recipients in the same conversation receive it within ~100 ms in the same region, ~500 ms cross-region.
Messages are durably stored and retrievable by conversation, sorted oldest-first, with pagination.
Ordering is preserved within a conversation (a reply to my own message arrives after my message); cross-conversation ordering is not promised.
Delivery receipts ("sent", "delivered", "read") and presence ("online", "typing") flow through the same transport.
Group conversations with up to ~10,000 members must work without per-member message fan-out becoming the bottleneck.

Non-functional requirements:

Throughput: 10M concurrent connections per region at peak; 100K messages per second sustained, 1M during a viral event.
Latency: end-to-end message delivery p95 under 100 ms within a region; p99 under 500 ms cross-region.
Durability: every accepted message is persisted before the sender's ack returns. Loss after ack is a P0 bug.
Consistency: per-conversation total order. Eventual consistency across conversations is fine.
Cost ceiling: WebSocket cost dominates (a persistent connection is several KB of kernel memory). Cost per concurrent user per month is the watched metric.

The architect's framing

A chat system decomposes into seven components arranged in two planes: a connection plane (long-lived sockets) and a data plane (storage and fan-out).

                              connection plane
                              ----------------
       +----------+           +-----------------+
       |  client  | <-- WS -->|  gateway tier   |
       +----------+           |  (per-user      |
                              |   socket state) |
                              +--------+--------+
                                       |
                                       v
                              +-----------------+
                              |  routing /      |
                              |  presence svc   |
                              +--------+--------+
                                       |
                                  pub-sub bus  (Kafka / NATS / custom)
                                       |
       data plane                      |
       ----------                      v
                              +-----------------+
                              |  message store  |
                              |  (Cassandra /   |
                              |   ScyllaDB)     |
                              +--------+--------+
                                       |
                          +------------+-------------+
                          v                          v
                  +---------------+         +-----------------+
                  |  search index |         |  push notify    |
                  |  (Elastic)    |         |  (APNs / FCM)   |
                  +---------------+         +-----------------+

The architect's commitments:

WebSocket is the transport; HTTP polling is the fallback. Modern clients can hold a WebSocket for hours; mobile clients with backgrounded apps fall back to push notifications via APNs (iOS) and FCM (Android).
State is sharded by conversation, not by user. A conversation is the natural unit of ordering and fan-out. A user with 100 conversations talks to 100 partitions; a partition handling 1,000 conversations talks to ~1,000 users.
The gateway tier is stateful (it knows which users are connected to which gateway nodes). The data plane is the source of truth for messages; the gateway is the source of truth for "who is online right now".
The pub-sub bus is the integration point. Every message published to the bus reaches every interested gateway, which forwards to the connected user. This is where the Graphs BFS pattern enters production.

The trade-offs we will name

1. WebSocket vs long-polling vs Server-Sent Events.

Choice: WebSocket as primary, HTTP/2 streams as a modern alternative, long-polling as the universal fallback.
Alternative considered: pure SSE (server-to-client only, but no client-to-server reuse).
Why WebSocket wins: full-duplex over a single TCP connection. Slack, Discord, Microsoft Teams, and WhatsApp all use WebSocket (or a binary protocol over a WebSocket-style upgrade). The browser support is universal; the protocol overhead is low.
When SSE wins: read-only feeds (live scores, stock tickers) where the client never sends back. Twitter's live event stream uses SSE.
When long-polling wins: networks that strip WebSocket frames (some corporate proxies). Slack's fallback path is HTTP long-polling, transparently engaged when WebSocket fails.

2. Fan-out at write vs fan-out at read.

Choice: fan-out at write for direct messages and small groups; fan-out at read for very large groups and channels.
Alternative considered: pure fan-out-at-write everywhere.
Why hybrid wins: a 10,000-member channel writing to every member's inbox is 10,000 writes per message. At 100 messages per second per channel, that is 1M writes per second per channel. Storing messages at the channel (one write) and having each recipient pull on connection (or on filter match) is dramatically cheaper. Discord and Slack both use channel-storage for public channels and inbox-storage for DMs.
The mirror of the feed-ranking.md celebrity problem: Twitter's celebrity tweet is the chat-system equivalent of a viral channel message. The same fan-out hybrid solves both.

3. Strict per-conversation ordering vs causal ordering.

Choice: per-conversation total order assigned by the conversation's home shard.
Alternative considered: hybrid logical clocks (Lamport / vector clocks) giving causal ordering across conversations.
Why per-conversation wins: causal ordering is mathematically more correct but expensive to compute and confusing to display. Users care about "my reply appeared after the message I replied to in this conversation", not about the global causal order of everything. Per-conversation ordering is achievable by routing all writes for a conversation to one shard (the home shard), which assigns a monotonic sequence number.
When causal ordering wins: collaborative editing (Google Docs, Figma) where multiple writers race on the same document. Real-time chat is allowed to be simpler because conversations are linear.

4. At-most-once vs at-least-once delivery.

Choice: at-least-once with message-ID de-duplication on the client.
Alternative considered: at-most-once (best effort; missed messages are gone).
Why at-least-once wins: a missed message in a chat is a product failure. Users will reach for the app and reload. At-least-once means the gateway resends until the client acknowledges; the client de-duplicates by message ID. The same exactly-once-as-illusion pattern as payment-ledger.md, with much lower stakes per duplicate.
When at-most-once wins: typing indicators, presence updates ("now typing", "now online"). These are explicitly best-effort because a lost notification is invisible; resending stale "now typing" after the user stopped typing is worse than dropping it.

5. Connection-affinity routing vs anycast routing.

Choice: connection-affinity. Once a user's WebSocket lands on gateway node X, all subsequent traffic for that user goes through X.
Alternative considered: anycast / stateless routing.
Why affinity wins: the gateway maintains per-connection state (auth context, in-progress messages, presence). Migrating that state on every packet is expensive; pinning the connection to one node is essentially free. Slack, Discord, and Teams all do this; load balancers route via consistent hashing on the user ID.
When anycast wins: read-only fan-out paths where any node can serve the request. The publication side of the pub-sub bus is anycast; the consumer side is affinity.

Where the algorithms from this bank actually appear

Fan-out delivery is graph traversal. A message in a group conversation is a node; the edge to each recipient gateway is an "deliver" operation. The Graphs primer's BFS is the literal pattern: enqueue the message, dequeue and forward to each recipient, repeat. number-of-islands is the muscle memory for BFS at scale; the production version replaces grid neighbours with "users in this conversation" and the queue with a Kafka topic.

Presence aggregation ("how many of my friends are online?") is the union-find / connected-components flavour of Graphs. number-of-islands and clone-graph build the foundation; the production version maintains a presence-graph with edge updates as users connect and disconnect.

Time-window aggregations for "did this conversation receive a message in the last 5 minutes?" are the Intervals pattern. Per-conversation activity windows are intervals on the time axis; counting active conversations in a region is the resource-counting sweep from meeting-rooms. At scale, this becomes the input to auto-scaling decisions: shrink gateway capacity in idle hours, expand for peak.

Message ordering within a connection uses a sliding-window protocol straight out of the TCP playbook. The Linked List primer's pointer discipline is what makes the implementation tractable: maintain a linked list of un-acknowledged messages, advance the head on ack, retransmit from the head on timeout. The same skeleton that powers middle-of-linked-list and reverse-linked-list powers a robust reliable-delivery protocol.

Message-search indexing (find every message I sent to Alice last Tuesday) uses inverted indexes and is adjacent to Arrays & Hashing. The inverted index is term -> sorted_list_of_doc_ids; query is the intersection of those lists, which is a Two Pointers merge. Slack's search runs on Elasticsearch, but the primitives are the same.

Sketch implementation

The gateway's per-connection write loop, the heart of the delivery path:

import asyncio
from collections import deque

class Connection:
    """One WebSocket connection per user device.
    Maintains a deque of in-flight messages awaiting ack."""

    def __init__(self, user_id, ws, conversations):
        self.user_id = user_id
        self.ws = ws                                # websocket handle
        self.conversations = conversations          # set of conversation_ids
        self.outbox = deque()                       # [(seq, msg)] in order
        self.last_acked_seq = 0
        self.next_seq = 1
        self.lock = asyncio.Lock()

    async def deliver(self, conversation_id, msg):
        # Called when the pub-sub bus has a message for one of our conversations.
        if conversation_id not in self.conversations:
            return
        async with self.lock:
            seq = self.next_seq
            self.next_seq += 1
            payload = {'seq': seq, 'cid': conversation_id, 'msg': msg}
            self.outbox.append((seq, payload))
        await self.ws.send_json(payload)

    async def on_ack(self, seq):
        async with self.lock:
            while self.outbox and self.outbox[0][0] <= seq:
                self.outbox.popleft()
            self.last_acked_seq = max(self.last_acked_seq, seq)

    async def retransmit_loop(self, interval=1.0):
        while True:
            await asyncio.sleep(interval)
            async with self.lock:
                pending = list(self.outbox)
            for (seq, payload) in pending:
                if seq > self.last_acked_seq:
                    try:
                        await self.ws.send_json(payload)
                    except ConnectionError:
                        return

The pub-sub consumer that dispatches incoming messages to the right Connections:

async def consumer_loop(bus, gateway_registry):
    """One per gateway node. Receives messages for any conversation that
    has at least one connected user on this node."""
    async for event in bus.subscribe(gateway_registry.conversation_ids()):
        cid = event['conversation_id']
        users = gateway_registry.users_in(cid)
        # parallel delivery to all local connections for this conversation
        await asyncio.gather(*[
            gateway_registry.connection(u).deliver(cid, event['msg'])
            for u in users
        ])

Invariants the code preserves:

Per-connection sequence numbers are monotonic. A client receives messages in the order the gateway assigned, never reordered.
Outbox stays sorted by sequence. Acks advance from the head; retransmissions iterate from the head.
The lock is per-connection, not global. A million concurrent connections share a million locks, never one global lock.
The pub-sub bus is the only fan-out point. Adding a new gateway node means subscribing to the relevant conversation topics; no other coordination is required.

What the code omits but production needs: backpressure (slow clients should not OOM the gateway with infinite outbox growth), end-to-end encryption (Signal protocol, MLS), idempotent sends (so a retry from the client does not duplicate), per-conversation rate limiting (so a misbehaving bot does not flood the channel), and graceful migration when a gateway node is drained.

What breaks at scale

1,000 concurrent users: a single Node.js or Go server with in-memory state and a Postgres backend handles this. Premature distribution is the bigger risk than insufficient distribution.

10,000 concurrent users: still one gateway node, but messages now live in a real database with a per-conversation table. Fan-out is in-process. The architect's worry is correctness of the message ordering, not throughput.

100,000 concurrent: connection counts cross what a single node can hold. Multi-node gateway with a load balancer doing sticky routing on user ID. Pub-sub bus (Redis pub-sub or NATS) for cross-node fan-out. Cassandra or ScyllaDB for the message store; per-conversation sharding.

1M concurrent (Slack / Discord regional cluster): gateway tier is 50-100 nodes. Pub-sub bus is Kafka or NATS JetStream. Connection registry (which user is on which node) is a Redis sorted set or a custom service. Push notification gateway integrated with APNs/FCM for mobile clients with backgrounded apps. Search is a separate Elasticsearch fleet, indexed asynchronously off the message log.

10M+ concurrent (WhatsApp, large Discord region): regional sharding by geographic locality. Conversations have a home region; cross-region delivery uses replication between regional pub-sub buses. Erlang or Go is the dominant language at this tier because the per-connection memory cost in JVM languages becomes painful. WhatsApp famously did 2M+ connections per FreeBSD node by tuning kernel parameters and using a custom Erlang scheduler.

Named failure modes:

Thundering herd on reconnect: a gateway node crashes; 100K clients reconnect simultaneously to surviving nodes; those nodes saturate. Mitigation: client-side jittered reconnect backoff (1-30 seconds randomised), gateway capacity reserved for reconnect surges, and connection-rate limiting at the LB.
Hot conversation: a viral channel receives 10K messages per second. The home-shard pub-sub partition saturates. Mitigation: per-conversation rate limiting (Slack visibly throttles posts in active channels), or sharding the conversation across multiple partitions with a fan-in step.
Slow consumer: one device on a flaky network cannot keep up; its outbox grows; gateway memory inflates. Mitigation: outbox size cap with disconnection on overflow (the device reconnects and replays from a checkpoint).
Pub-sub bus partition: Kafka cluster split brain leaves two halves of the gateway fleet talking to different views of the message stream. Mitigation: single-leader-per-partition discipline, with Kafka's ISR protocol enforcing this; client-visible degradation is preferable to a silent divergence.
Message ordering bug across region migration: a conversation's home shard is moved between regions; messages in flight during the move arrive out of order. Mitigation: drain-and-handoff protocol where the old shard refuses new writes once the new shard is ready; cross-region replication acknowledges the handoff explicitly.

What an interview / staff-engineer review will ask

Q: Walk me through what happens when User A sends "hello" to a 5-person group. A: A's client sends POST /messages over the WebSocket. A's gateway authenticates, generates a message ID, calls the conversation's home shard to assign a sequence number, and writes to the message store. The store-write success returns; the gateway publishes to the conversation's pub-sub topic. The pub-sub bus broadcasts to all gateway nodes subscribed to that conversation; each gateway looks up which connected users are in the group; each connection forwards the message over its WebSocket. The clients ack their receipt; the server records delivery receipts. The whole loop takes ~50-100 ms within a region.

Q: How do you handle a user who has been offline for a week? A: The user's last-seen sequence number per conversation is stored. On reconnect, the client requests "give me messages with sequence > X for conversation Y". The message store returns the deltas. If the gap is too large (the user has been offline for a month), the response is paginated and the client backfills incrementally. Push notifications via APNs/FCM are independent: the user receives a push for messages while their app was backgrounded, but the in-app history reload is the source of truth.

Q: How do typing indicators differ from messages in your architecture? A: Typing indicators are ephemeral and at-most-once: a fire-and-forget pub-sub event, never persisted, with a short TTL (3-5 seconds). They share the WebSocket but not the durable store. Presence ("online", "away") is similar but with a longer TTL (30-60 seconds) and a heartbeat mechanism so a dead client eventually decays to offline. Messages are durable; typing and presence are not.

Q: What if Kafka is down? A: Three options, in order of preference: (1) a hot standby Kafka cluster with replication, so an entire cluster failure triggers failover within seconds. (2) Direct gateway-to-gateway gossip for in-region delivery as a fallback (the message store is still the durable truth). (3) Degradation mode where the system explicitly accepts higher latency and informs users. Discord has published incident reports detailing exactly this trade-off; their answer is heavy investment in the pub-sub bus's reliability.

Q: How would you scale a Slack-style search across years of channel history? A: Asynchronous indexing into a sharded Elasticsearch fleet, keyed by workspace ID. Index is built from the message log, not from live messages, so a search outage does not slow message delivery. Query budget is separate from chat budget (~1 second is acceptable for search; chat needs 100 ms). At Slack's scale, the search infrastructure is bigger than the chat infrastructure by index size.

In the AI-integrated workspace

AI is reasonably effective at writing the WebSocket plumbing (the gateway, the outbox, the retransmit loop) and surprisingly bad at the delivery contract (what does at-least-once actually require, where does ordering break, when do you accept eventual consistency).

What the architect uses AI for:

Protocol scaffolding: the WebSocket handshake, ping/pong heartbeats, JSON message envelopes, error code definitions. AI is a multiplier.
Reconnection logic: jittered exponential backoff, replay-from-sequence on reconnect, idle-timeout handling. AI handles these patterns competently.
Test simulators: load-testing 10K simulated WebSocket clients is mechanical work; AI can produce the harness in a session.

What the architect audits personally:

Ordering invariants. AI commonly produces code where the sequence-assignment and the message-persistence are not atomic; the architect demands a single transaction (or a single Cassandra LWT) that does both.
Backpressure. AI rarely writes a bounded outbox; without one, a slow client OOMs the gateway. The architect adds the cap and the disconnection policy.
The pub-sub bus contract. AI will often use Redis pub-sub for fan-out, which has at-most-once semantics (no persistence on disconnect). The architect catches this and demands Kafka or NATS JetStream for any path where at-least-once matters.
End-to-end encryption. If the product needs E2EE (Signal, WhatsApp), AI's first draft will inevitably encrypt only the transport, not the message payload. The architect demands the double-ratchet or MLS discipline.

What the architect refuses from AI:

Any "ordering" implementation based on wall-clock timestamps. Wall-clock is not monotonic and drifts between nodes; ordering must come from a logical sequence number assigned by a single authority.
Any reconnection logic without jitter. A fleet of clients reconnecting simultaneously is a self-DDoS.
Any feature that "remembers the last 100 messages per user in memory". That is a memory leak waiting to happen with a million connections; use a TTL'd cache or fall back to the durable store on read.
Any storage of plaintext messages in logs. Chat content in observability pipelines is a privacy incident; the architect explicitly blocks this path.

Variants and adjacent systems

Real-time collaborative editing (Google Docs, Figma, Notion): same WebSocket transport, different consistency model (operational transformation or CRDTs instead of per-conversation total order).
Voice and video calls (Discord, Zoom, WhatsApp Calls): WebRTC for media plus a signalling channel that often shares the chat WebSocket. SFU (Selective Forwarding Unit) topology vs MCU vs mesh is the structural debate.
IoT message broker (MQTT, AWS IoT Core): the same pub-sub fan-out at a different transport (MQTT instead of WebSocket) and different priorities (battery efficiency, retained messages).
Live streaming chat (Twitch, YouTube Live): the chat is essentially a single large "channel" with millions of concurrent readers and tens of thousands of writers per minute. Fan-out at read with aggressive batching.
Group SMS / RCS: the carrier-layer analogue. The protocol is older and less flexible, but the architectural challenges (fan-out, delivery receipts, ordering) are the same.
Game lobby / matchmaking chat: a smaller-scale chat with tight integration into game state. Halo, League of Legends, and Fortnite all run chat infrastructure on this pattern; the integration with anti-cheat and player banning is the differentiator.