Design: Slack's Realtime Communication (Text)
Requirements
- Multiple users, multiple channels
- Users can DM or message in channels
- Real-time chat delivery
- Historical messages can be scrolled through
Similar systems: Any realtime chat, realtime polls, creator tools, live interactions.
Data Model
Tables
users: u_id, name, ...
channels: c_id, type (dm/channel), name, ...
messages: m_id, channel_id, sender_id, text, timestamp
membership: user_id, channel_id, is_muted, is_starred, ...
Key insight: DMs are modeled as channels with exactly 2 members. One code path for everything — no separate DM table, no separate fanout logic.
Messages DB
- Partition key:
channel_id— all messages for a channel on the same shard - Sort key:
timestamp— messages pre-sorted by time, range scans are free - Historical message retrieval via REST:
GET /ch/{channel_id}/msgs?p=1
3 Patterns for Message Communication
| Pattern | How It Works | Example |
|---|---|---|
| 1. Persist first, then deliver | Write to DB → on success → fan out via WS | Slack (persistence is critical) |
| 2. Deliver first, persist async | Send via WS → async worker persists to DB | WhatsApp (local persistence on device) |
| 3. Realtime only, no persistence | Send via WS → never persisted | Zoom chat, Google Meet chat |
Slack uses pattern 1 — if you can see a message, it's guaranteed to be saved. No "ghost messages" that disappear on refresh.
Why Not HTTP Polling?
With plain HTTP, the client has to poll for new messages:
Bob's browser: GET /messages?channel=general&since=<timestamp> (every 2 seconds)
Server: "nothing new" / "here's 3 new messages"
At Slack's scale: 10M users × poll every 2 sec = 5 million requests/second, most returning empty responses. Massive waste of bandwidth and server resources.
Solution: Keep a persistent, bidirectional connection open — WebSockets.
WebSockets: The Fundamentals
A WebSocket starts as a normal HTTP request, then upgrades to a persistent TCP connection:
Client Server
| |
|--- HTTP GET /chat ---------->|
| Upgrade: websocket |
| |
|<-- HTTP 101 Switching --------|
| Protocols |
| |
|================================| ← persistent TCP connection
| full duplex: both sides |
| can send at any time |
|================================|
After the upgrade, either side can send at any time — no request/response cycle. The connection stays open for minutes, hours, or days.
What the server holds per connection
Each WS server maintains in-memory state for every connected user:
This is socket membership — semi-persistent state (lives in memory, lost if the server restarts).
The Full Message Flow (Slack's Architecture)
The client maintains two separate connections:
- HTTP → API server (for sending messages, loading history)
- WebSocket → WS server (for receiving real-time pushes)
sequenceDiagram
participant Alice as Alice's Browser
participant API as API Server
participant DB as Messages DB
participant PS as Pub/Sub
participant WS1 as WS Server 1
participant WS2 as WS Server 2
participant Bob as Bob's Browser
Alice->>API: HTTP POST "hello" to #general
API->>DB: Persist message
DB-->>API: 200 OK
API->>PS: Publish to #general topic
PS->>WS1: Broadcast
PS->>WS2: Broadcast
WS1->>WS1: Check local connections for #general
WS2->>WS2: Check local connections for #general
WS2->>Bob: Push "hello" via WebSocket
Why two paths?
- HTTP for sending = request/response, guaranteed ack, retryable. Alice knows the message was saved.
- WebSocket for receiving = fire-and-forget push, fast. Bob sees it instantly.
Why persist first?
If we fan out before persisting and the DB write fails:
- Bob saw the message
- Alice thinks it was sent
- But it's not in the database
- Both refresh → message is gone (ghost message)
Persist-first guarantee: if you can see it, it's saved.
Fanout: Pub/Sub Broadcast
When a message is persisted, the API server publishes to a pub/sub system (Redis Pub/Sub or similar). All WS servers receive it:
API Server: "new message in #general"
│
▼
Pub/Sub (Redis)
│
┌────┼────┐
▼ ▼ ▼
WS1 WS2 WS3 ← all receive it
│ │ │
└→ each checks local socket membership
"do any of my connections care about #general?"
if yes → push down the WebSocket
Broadcast vs targeted delivery:
- Broadcast to all — simple, no state to maintain. Wasteful for small DM channels, but for popular channels almost every WS server has a member.
- Registry lookup — efficient (only notify relevant WS servers), but requires keeping a connection registry in sync. More complexity for marginal gain at Slack's scale.
Slack uses broadcast within "pockets" (see scaling section below).
Non-Delivery of Realtime Messages
WebSocket delivery is best-effort. What if Bob's connection drops right when the message is being pushed?
Solution: Client-side checkpoints.
The client tracks the timestamp of the last message it received. On reconnect:
The DB is the source of truth. The WebSocket is a real-time optimization. If it fails, the client catches up via HTTP.
This is why persist-first matters — the data is always there waiting.
Scaling WebSockets: Pockets
A single WS server can handle ~60K persistent connections. At Slack's scale (millions of users), you need many WS servers.
The Pocket Architecture
Instead of broadcasting to ALL WS servers globally, organize them into pockets — groups of WS servers that share a pub/sub bus:
graph TD
CB["Connection Balancer"] --> P1["Pocket 1<br/>(Workspace: Acme Corp)<br/>wss1, wss2, ... wss100"]
CB --> P2["Pocket 2<br/>(Workspace: BigCo)<br/>wss101, ... wss200"]
P1 --- PS1["Redis Pub/Sub<br/>(Pocket 1)"]
P2 --- PS2["Redis Pub/Sub<br/>(Pocket 2)"]
style CB fill:#4a3a1a,stroke:#ffaa44
style PS1 fill:#1a4a1a,stroke:#44ff44
style PS2 fill:#1a4a1a,stroke:#44ff44
Key insight: A Slack workspace rarely exceeds 60K members. So one pocket (a handful of WS servers) can serve an entire workspace. The pub/sub broadcast only goes to servers within the pocket, not globally.
Connection Balancer routes a user to the correct pocket based on their workspace.
Within a pocket, the message service knows the topology — which WS servers hold connections for which channels — so it can target even more precisely.
The 64K Port "Myth"
A TCP connection is identified by a 4-tuple: (src_ip, src_port, dst_ip, dst_port)
With a Load Balancer in front:
LB_IP:port1 → Server:443
LB_IP:port2 → Server:443
...
LB_IP:port65535 → Server:443 ← capped at ~64K (one source IP)
Direct client connections (no LB):
Client1_IP:port → Server:443
Client2_IP:port → Server:443
Client3_IP:port → Server:443 ← millions possible (unique 4-tuples)
WhatsApp handles ~2-3 million connections per server using direct connections. The real bottleneck becomes memory and file descriptors, not ports.
The tradeoff: Removing the LB means WS servers are "naked on the internet" — they must handle DDoS protection, TLS termination, and authentication themselves. This is a hard security problem that WhatsApp has solved.
Real-World Architecture Details
Slack's Internal Services
| Service | Role |
|---|---|
| Channel Servers | Hold channel state, mapped via consistent hashing on channel_id |
| Gateway Servers | The WS servers — geo-distributed across AWS regions |
| Flannel | Edge cache that solved the boot problem (see below) |
| Presence Servers | Track who is online (the green dot) |
Flannel — when a client connects, it needs workspace state (channels, members, unread counts). Originally this was a giant payload. Flannel caches this at regional edge nodes — 44x payload reduction for large workspaces (~32K users).
Discord's Cassandra → ScyllaDB Migration
Discord stored messages in Cassandra with partition_key = (channel_id, time_bucket). It worked until:
- Hot partitions — viral announcements in huge servers = millions reading same partition
- Java GC pauses — caused latency spikes requiring manual node reboots
They migrated to ScyllaDB (Cassandra-compatible, C++, no GC) and built a request coalescing layer: if 1000 users request the same channel's messages, the DB is queried only once.
Result: 177 Cassandra nodes → 72 ScyllaDB nodes, p99 read latency 40-125ms → 15ms.
WhatsApp's Erlang Architecture
WhatsApp achieves millions of connections per server using Erlang:
- Each connection = one Erlang process (~300 bytes of memory)
- Not OS threads, not goroutines — Erlang's native lightweight processes
- The "naive" approach (one process per connection) IS the efficient approach in Erlang
- Erlang was built for telecom switches — fault isolation is built in
Lesson: Sometimes choosing the right runtime eliminates entire categories of complexity.
Recommended Reading
- Real-time Messaging — Slack Engineering
- Flannel: Application-Level Edge Cache — Slack Engineering
- Migrating Millions of WebSockets to Envoy — Slack Engineering
- How Discord Stores Trillions of Messages
Topics for Later
- WebSocket implementation in Python (asyncio + websockets library)
- Long polling and Server-Sent Events (SSE) as alternatives to WebSockets
- Edge server security — DDoS protection, TLS termination at the WS layer
- Presence system design (the "online" green dot)
- Message ordering and consistency in distributed chat