parikshan

Case: Slack's Real-Time Messaging and Channel Servers

Era: 2013 to present  ·  Author / source: Slack Engineering blog posts, including "Real-time Messaging" (Slack Engineering), "Flannel: an application-level edge cache" (2017), "Scaling Slack's Job Queue" (2017), and "How Slack Built Shared Channels"  ·  Read alongside: pub/sub fanout, WebSocket gateways, consistent hashing, durable job queues

The situation

Slack is, mechanically, a real-time fan-out problem. A message posted in a channel needs to be delivered, in order, to every client currently watching that channel, anywhere on Earth, in well under a second. Slack's published latency target is "messages across the world in 500 ms," and that includes WebSocket delivery to the receiving clients, not just a write at the API.

The shape of the problem is unusual:

  • Channels have widely uneven activity. Some workspace #general channels have tens of thousands of subscribers; most channels have fewer than ten.
  • The same workspace has long-lived data (messages, users) and high-churn data (typing indicators, presence). They need different durability stories.
  • Slack started single-region and grew into a global workforce tool. Round-trip-time budgets across continents are unforgiving.
  • Job throughput is huge. Slack's job queue, by 2017, processed "over 1.4 billion jobs daily at peak rates of 33,000 per second."

By 2017 the original architecture was straining. The old rtm.start API call returned the entire team roster, channel list, and bot directory on every WebSocket connect. For a workspace with thousands of users, that payload was megabytes; reconnection storms after a partial outage could compound into a self-inflicted DDoS.

The options on the table

For real-time delivery and team-data hydration:

  1. Pure pub/sub through Kafka or a broker. Conceptually clean, hard to keep ordered per-channel under partition reassignment, and adds operational surface area.
  2. Polling. Simple, fails at Slack's scale and battery cost on mobile.
  3. Long-poll HTTP. Half a step toward WebSockets; same scaling concerns.
  4. WebSockets with stateful gateway servers, channel-affinity routing. Persistent connections, consistent-hashed channel ownership, in-memory state for hot data. Operationally heavy.
  5. Send everything through the API monolith every time. Simple, kills your latency budget instantly.

For workspace-data hydration on connect:

  1. Send full state in rtm.start. What they did originally. Breaks for large teams.
  2. Lazy load via a request/response API. Lighter connect, more queries per second.
  3. Edge cache for team data, lazy load via the cache. Lower connect time, smaller payloads, cache shared across reconnections.

For job processing:

  1. Pure Redis queues. What Slack started with. Memory-bound, vulnerable to enqueue/dequeue imbalance.
  2. Pure Kafka. Durable, ordered, but the existing application code expected Redis semantics.
  3. Kafka as durable buffer, Redis as execution queue. Two-layer system that preserves the Redis programming model while removing the memory ceiling.

What they chose, and why

For real-time messaging, Slack converged on a set of stateful Java services. The relevant ones, as Slack's own engineering writing describes them:

  • Channel Servers (CS). Stateful, in-memory. Each Channel Server owns a set of channels assigned by consistent hashing. Channel history and recent events live in memory on that server. At peak, "about 16 million channels are served per host."
  • Gateway Servers (GS). Stateful, hold user WebSocket subscriptions. Distributed across geographic regions so clients connect to a nearby gateway.
  • Admin Servers (AS). Stateless. Intermediaries between the webapp backend and the Channel Servers.
  • Presence Servers (PS). Track online/offline status.

A new message flows: Client posts to webapp -> webapp writes to database -> Admin Server -> Channel Server -> all Gateway Servers with subscribers -> connected clients. Transient events such as "User typing in a channel" skip database persistence; they flow directly through Channel Servers to subscribed clients. Slack's writeup explicitly calls out these as "transient events" that do not need durable storage.

For team-data hydration, Slack built Flannel, an application-level edge cache deployed at points-of-presence. Per Slack's writeup, Flannel serves "4 million simultaneous connections" at peak and "600K client queries per second." It reduced bootstrap payload "by 7x" for a 1.5K user team and "by 44x" for a 32K user team. The architectural shift was from eager load (rtm.start returns everything) to lazy load on demand through a regional edge cache, with proactive prefetch ("When broadcasting a message mentioning a colleague, Flannel sends user data to clients lacking recent information") to avoid round-trips on common access patterns.

For job processing, Slack chose the two-layer approach: a stateless Go service (Kafkagate) writes jobs into Kafka (16 brokers, 32 partitions per topic, 3x replication, 2-day retention), and a relay service (JQRelay) drains from Kafka into Redis for execution. The Redis-shaped programming model is preserved; the memory exhaustion failure mode is eliminated.

For workspace topology, the "How Slack Built Shared Channels" writeup explains that originally "when a workspace was created, it was assigned to a specific database shard, messaging server shard, and search service shard." Shared channels broke that assumption; Slack reworked the data model so "every channel in Slack (including shared and non-shared channels) has a single channels row that lives on the originating workspace's shard," with a shared_channels table holding the cross-workspace bridge rows.

What they gave up

  • Operational simplicity. Stateful servers with consistent-hashed channel ownership are operationally heavier than stateless edge functions. Failover, rebalancing, and rolling restarts are nontrivial.
  • A single source of truth at runtime. Channel state lives in CS memory; database is the durable log. Recovery on CS failure depends on replaying recent state, which is faster than going to the database but introduces its own bug surface.
  • Cross-region transactional semantics. Gateway Servers are regional; messages cross regions to reach all subscribers. Slack accepted that some properties are eventually consistent across regions.
  • Strict global ordering across channels. Per-channel ordering is preserved because all writes to a channel go through one Channel Server. Across channels, ordering is whatever the database write order says.
  • Pure-Redis simplicity in the job queue. The Kafka layer is real operational complexity, paid in exchange for durability and rate-limit headroom.

How it played out

Slack scaled past hundreds of thousands of paid customer organizations and tens of millions of daily active users on this design. The 500 ms global delivery target has held publicly. Major user-visible incidents have been rare relative to Slack's complexity, and where they have happened (notably a 2021 Slack-wide outage), the proximate cause was infrastructure-side, not in the real-time messaging stack itself.

Flannel's reduction in connect-time payload (44x for large teams) was decisive in keeping the largest enterprise customers usable. The job queue redesign let Slack absorb workload growth without doubling Redis capacity every year.

The shared-channels redesign laid the groundwork for Slack Connect, which became a major commercial product line and required cross-workspace data that the original sharded-by-workspace model could not represent cleanly.

Where it ties to this bank's patterns

  • [[websocket-gateway-pattern]]: stateful gateway plus stateless API monolith is a now-common pattern.
  • [[consistent-hashing]]: the mechanism that makes channel-affinity routing tractable.
  • [[edge-caching-application-data]]: Flannel is a canonical example of caching not just static assets but application data structures.
  • [[durable-queue-vs-execution-queue]]: the Kafka + Redis pattern is broadly applicable.
  • [[sharding-by-tenant]]: workspace-shard pattern, and what to do when the tenant model evolves (shared channels).
  • Problem links: chat product design, real-time collaboration, presence systems, multi-tenant data isolation.

What a candidate should take away

  1. Real-time at scale is consistent hashing plus stateful gateways. Pure pub/sub does not give you per-channel ordering or per-channel affinity for free.
  2. Eager load is the enemy of cold start. The shift from rtm.start to Flannel is the same lesson as serverless cold-start: send the minimum, fetch the rest on demand.
  3. Separate the durable log from the execution queue. Kafka and Redis are not competitors here; they are layers.
  4. Sharding by tenant is a great default until the tenant model changes. Slack Connect (cross-workspace channels) would not have been possible without restructuring the workspace-sharded assumption.
  5. Transient and durable events deserve different paths. Typing indicators do not need a database write; messages do. Treating them the same wastes both budgets.

What an AI agent would not have got right

  • An AI asked to "build a chat system" will produce a stateless API plus a single pub/sub broker. It will not propose stateful Channel Servers with consistent-hashed ownership, because that pattern is operationally expensive and AI advice favors clean stateless designs.
  • It will conflate the durable log and the execution queue. The first sketch will be either "use Redis for everything" (Slack's old failure mode) or "use Kafka for everything" (preserves nothing of the existing application contract).
  • It will treat rtm.start-style eager load as obviously correct ("send everything the client needs in one call"). The Flannel insight is that this is fine until your largest customer has 32K users.
  • It will not separate transient from durable events. Typing indicators in the database is the default sketch, and it is wrong.
  • It will not anticipate that tenant boundaries are a product decision, not a permanent architectural law. Workspace-sharded systems are easy to build and very hard to change once the product needs cross-tenant features.

Sources