Real-time systems architecture
In this series (12 parts)
- Monolith vs microservices
- Microservice communication patterns
- Service discovery and registration
- Event-driven architecture
- Distributed data patterns
- Caching architecture patterns
- Search architecture
- Storage systems at scale
- Notification systems
- Real-time systems architecture
- Batch and stream processing
- Multi-region and global systems
A user opens a chat window and sends “hey.” The recipient sees the message appear within 200 milliseconds. No page refresh, no polling delay, no “pull to refresh.” That immediacy is what users expect from any modern application with social or collaborative features.
Delivering this experience is architecturally different from request-response HTTP. HTTP is stateless and short-lived: the client sends a request, the server responds, the connection closes. Real-time features need the opposite. The connection must stay open, the server must be able to push data without being asked, and the system must track which users are connected to which servers.
Defining real-time requirements
“Real-time” means different things depending on context. A stock trading system requires sub-millisecond latency. A chat application needs messages delivered within a few hundred milliseconds. A live sports scoreboard tolerates one to two seconds of delay. A collaborative document editor needs updates within 500ms to feel responsive.
Be precise about your requirements. “Real-time” in a system design discussion usually means soft real-time: latency matters to user experience, but a 500ms delay does not cause data loss or safety failures. Hard real-time (industrial control systems, pacemakers) is a different domain with different tools.
The two metrics that matter most are latency (how quickly a message reaches the recipient after being sent) and throughput (how many messages per second the system can deliver). A chat app for a small team might handle 100 messages per second. A live event platform with millions of concurrent viewers might need to deliver 10 million messages per second.
WebSocket servers at scale
WebSockets provide full-duplex communication over a single TCP connection. After an HTTP handshake upgrades the connection, both client and server can send frames at any time without the overhead of HTTP headers. This makes WebSockets ideal for real-time features where the server needs to push data proactively.
A single WebSocket server can hold tens of thousands of concurrent connections. The limiting factors are memory (each connection consumes a socket descriptor and some buffer space), CPU (serializing/deserializing messages), and network bandwidth. A well-tuned server on modern hardware handles 50,000 to 100,000 concurrent WebSocket connections.
But your system needs millions of concurrent connections. That means running a fleet of WebSocket servers behind a load balancer. Here is the first complication: WebSocket connections are stateful. When user A connects to server 3 and user B connects to server 7, a message from A to B must be routed from server 3 to server 7. The load balancer does not know about this; it just distributed the initial connections randomly.
graph TD UA["User A"] -->|WS Connect| LB["Load Balancer"] UB["User B"] -->|WS Connect| LB UC["User C"] -->|WS Connect| LB LB --> WS1["WS Server 1"] LB --> WS2["WS Server 2"] LB --> WS3["WS Server 3"] WS1 <-->|Pub/Sub| PS["Pub/Sub Layer (Redis / Kafka)"] WS2 <-->|Pub/Sub| PS WS3 <-->|Pub/Sub| PS PS --> WS2 WS2 -->|Deliver to User B| UB
WebSocket connection management with a pub/sub layer routing messages between servers.
The solution is a pub/sub layer between WebSocket servers. When user A sends a message to a chat room, server 3 publishes it to a channel representing that room. All WebSocket servers subscribe to channels for the rooms their connected users belong to. Server 7, which holds user B’s connection, receives the published message and pushes it to user B.
Redis Pub/Sub is a common choice for this layer due to its low latency. For higher throughput or durability needs, Kafka topics work well. The trade-off is latency (Redis is sub-millisecond, Kafka adds a few milliseconds) versus reliability (Kafka persists messages, Redis Pub/Sub is fire-and-forget).
Connection state management
Each WebSocket connection has a lifecycle: connect, authenticate, subscribe to channels, exchange messages, and eventually disconnect (gracefully or due to network failure). Managing this lifecycle at scale requires a connection registry.
The connection registry tracks which user is connected to which server. When user A sends a message to user B, the system looks up user B’s server assignment in the registry and routes the message there. Redis is commonly used as the registry because it supports fast key-value lookups with TTL-based expiration for detecting stale connections.
Connection drops happen constantly. Mobile users move between cell towers. Laptops go to sleep. Network partitions occur. Your system needs to distinguish between a user who intentionally disconnected and one who lost connectivity temporarily.
Heartbeat mechanisms solve this. The server sends a ping frame every 30 seconds (configurable). If the client does not respond with a pong within a timeout window, the server considers the connection dead and cleans up. On the client side, if no ping arrives, the client assumes the server is unreachable and attempts to reconnect.
Reconnection should be seamless. When a client reconnects, it sends the timestamp or sequence number of the last message it received. The server replays any missed messages from a short-lived buffer (typically stored in Redis with a TTL of a few minutes). This avoids the user seeing gaps in their chat history after a brief network interruption.
sequenceDiagram
participant Client
participant WS Server
participant Registry
participant Buffer
Client->>WS Server: Connect + Auth Token
WS Server->>Registry: Register(userId, serverId)
WS Server->>Client: Connected
loop Every 30s
WS Server->>Client: Ping
Client->>WS Server: Pong
end
Note over Client,WS Server: Network drops
WS Server->>Registry: Unregister(userId) after timeout
Client->>WS Server: Reconnect + lastSeqNum
WS Server->>Registry: Register(userId, serverId)
WS Server->>Buffer: Fetch messages since lastSeqNum
Buffer->>WS Server: Missed messages
WS Server->>Client: Replay missed messages
Connection lifecycle showing heartbeat, disconnect detection, and reconnection with message replay.
Presence and online status
The green dot next to a user’s name (“online”), the typing indicator (“Alice is typing…”), and “last seen 5 minutes ago” are all presence features. They seem simple but scale poorly because presence is inherently fan-out heavy: if Alice has 500 friends, all 500 need to know when she comes online.
A naive implementation publishes a presence update to every friend on every state change. Alice connects: 500 updates. Alice disconnects: 500 more. Alice’s phone briefly loses signal and reconnects: 1,000 wasted updates. Multiply by millions of users and you have a presence storm that saturates your pub/sub layer.
Several techniques mitigate this. First, debounce state changes: do not publish “offline” until a user has been disconnected for at least 30 seconds. This absorbs brief network blips. Second, lazy presence: instead of pushing presence to all friends proactively, only compute it when someone views a friend list or opens a chat. The viewing client asks “is Alice online?” and the system checks the connection registry. Third, group presence by channel: in a chat room with 20 members, presence updates go to the room channel, not to each individual member’s connection.
Typing indicators use a similar pattern but with shorter timeouts. When a user starts typing, the client sends a “typing” event. The server publishes it to the conversation channel. If no follow-up “typing” event arrives within 3 seconds, the recipient’s client hides the indicator. This is entirely client-driven and should never be persisted.
Pub/sub for live updates
Beyond chat, pub/sub powers live dashboards, collaborative editing, sports scoreboards, and notification streams. The pattern is always the same: producers publish events to named channels, and consumers subscribed to those channels receive the events.
The choice of pub/sub technology depends on your requirements. For low-latency, ephemeral messaging (presence, typing indicators, live cursors), Redis Pub/Sub or a purpose-built system like NATS works well. Messages are delivered to current subscribers and discarded; if nobody is listening, the message is lost.
For durable, ordered event streams (chat message history, activity feeds, audit logs), Kafka or a similar log-based broker is better. Messages are persisted to a log, assigned sequence numbers, and consumers can rewind to any point. This also enables the reconnection replay pattern described earlier.
Channel granularity matters. A channel per chat conversation is fine for private messaging. A single channel for “all live sports scores” does not scale because every subscriber receives every score update, even for sports they do not follow. Using channels like scores:nba:lakers lets clients subscribe only to what they care about, reducing both server-side fan-out and client-side filtering.
Scaling WebSocket infrastructure
WebSocket servers are stateful, which makes horizontal scaling harder than stateless HTTP services. Here are the key scaling patterns.
Sticky sessions ensure that a client reconnects to the same server after a brief disconnection. This avoids re-subscribing to all channels. The load balancer hashes the user ID or connection ID to route consistently. If the original server is down, the client falls back to any available server and re-subscribes.
Server-side sharding assigns channel responsibility to specific servers. Instead of every server subscribing to every channel, a consistent hash ring determines which server “owns” each channel. Messages for that channel route through the owning server, which pushes to connected clients. This reduces the number of pub/sub subscriptions per server.
Edge WebSocket servers reduce latency for global users. Instead of all connections terminating at a single data center, deploy WebSocket servers at edge locations close to users. Edge servers connect to the central pub/sub layer via a backbone connection. The user gets low-latency connectivity, and the system still routes messages through a central broker.
graph TD
subgraph Edge US
E1["Edge WS Server"]
end
subgraph Edge EU
E2["Edge WS Server"]
end
subgraph Edge APAC
E3["Edge WS Server"]
end
E1 <--> BB["Central Pub/Sub Backbone"]
E2 <--> BB
E3 <--> BB
BB <--> DB["Message Store"]
U1["US User"] --> E1
U2["EU User"] --> E2
U3["APAC User"] --> E3
Edge WebSocket servers connected to a central pub/sub backbone for global low-latency delivery.
Handling backpressure
When the system generates messages faster than clients can consume them, backpressure builds. A slow mobile client on a 3G connection cannot keep up with a busy chat room producing 100 messages per second.
Server-side buffering absorbs short bursts. Each connection gets a bounded send buffer (say, 1000 messages). If the buffer fills, the server has to make a decision: drop the oldest messages, drop the newest, or disconnect the slow client. For chat, dropping oldest messages and catching up with the most recent is usually the right choice.
Client-side pagination complements this. Instead of delivering the full history over the WebSocket, the client fetches historical messages via a REST API (with pagination) and uses the WebSocket only for new messages arriving after the initial load. This separates the historical read path (HTTP, cacheable) from the real-time path (WebSocket, live).
What comes next
Real-time systems handle events as they happen. But many workloads need to process data in bulk or as continuous streams. The next article covers batch and stream processing: Spark and Flink at a conceptual level, lambda versus kappa architectures, and when each approach fits.