System Design: Real-time Chat at Million-User Scale
Posted on: 4/21/2026 8:15:59 PM
Table of contents
- Why is Chat System Design a Hard Problem?
- System Requirements Analysis
- High-Level Architecture
- WebSocket Gateway — The Heart of the System
- Message Routing — Delivering Messages to the Right Person
- Database Design — Storing Billions of Messages
- Presence System — "Online", "Last seen", "Typing..."
- Push Notifications for Offline Users
- End-to-End Encryption (E2EE)
- Scaling Strategies
- Recommended Tech Stack for Production
- Monitoring & Observability
- Real-world Lessons from WhatsApp and Telegram
- Conclusion
Why is Chat System Design a Hard Problem?
Every day, WhatsApp processes over 100 billion messages, and Telegram serves 900 million active users. Behind the simple interface — type, send, receive — lies an extremely complex distributed architecture that must simultaneously handle: real-time message delivery under 200ms latency, guarantee zero message loss, scale to millions of concurrent connections, and support group chats with up to 200,000 members.
This article dives deep into the architecture of a large-scale real-time Chat system — from WebSocket connection management, message routing, database design to presence systems and end-to-end encryption — from a practical production System Design perspective.
System Requirements Analysis
Functional Requirements
- 1:1 Chat: Send and receive real-time messages between two users
- Group Chat: Support groups with up to thousands of members
- Online/Offline Status: Display user activity status
- Read Receipts: Confirm message status (sent → delivered → read)
- Push Notifications: Notify when user is offline
- Media Sharing: Send photos, videos, and file attachments
- Message History: Store and sync message history across multiple devices
Non-functional Requirements
- Low Latency: End-to-end delivery under 200ms at the 99th percentile
- High Availability: 99.99% uptime (maximum 52 minutes downtime/year)
- Message Ordering: Messages must display in correct order within a conversation
- At-least-once Delivery: No message loss, accept duplicates (client-side dedup)
- Scalability: Horizontal scaling to 50M+ concurrent connections
Quick Estimation
Assuming 50 million DAU, each user sends an average of 40 messages/day → 2 billion messages/day → ~23,000 messages/second. Peak load at 3x → ~70,000 messages/second. Each message averages 200 bytes → storage ~400GB/day for raw messages, excluding metadata and indexes.
High-Level Architecture
graph TB
subgraph Clients["📱 Clients"]
C1[Mobile App]
C2[Web App]
C3[Desktop App]
end
subgraph Edge["Edge Layer"]
LB[Load Balancer
Sticky Sessions]
end
subgraph WS["WebSocket Gateway Cluster"]
WS1[WS Server 1
~500K connections]
WS2[WS Server 2
~500K connections]
WS3[WS Server N
~500K connections]
end
subgraph Services["Application Services"]
MS[Message Service]
PS[Presence Service]
GS[Group Service]
NS[Notification Service]
SYNC[Sync Service]
end
subgraph MQ["Message Queue"]
K[Kafka / RabbitMQ]
end
subgraph Storage["Data Layer"]
DB[(Message DB
Cassandra/ScyllaDB)]
CACHE[(Session Cache
Redis Cluster)]
S3[Media Storage
S3/R2]
IDX[(Search Index
Elasticsearch)]
end
C1 & C2 & C3 --> LB
LB --> WS1 & WS2 & WS3
WS1 & WS2 & WS3 --> MS
MS --> K
K --> DB
K --> NS
WS1 & WS2 & WS3 --> PS
PS --> CACHE
MS --> GS
NS --> C1 & C2 & C3
SYNC --> DB
style LB fill:#e94560,stroke:#fff,color:#fff
style K fill:#2c3e50,stroke:#fff,color:#fff
style DB fill:#16213e,stroke:#fff,color:#fff
style CACHE fill:#e94560,stroke:#fff,color:#fff
High-level architecture of a large-scale real-time Chat system
The architecture is divided into 4 main layers: Edge Layer handles connections and load balancing; WebSocket Gateway maintains persistent connections with clients; Application Services handle business logic; and Data Layer stores messages, sessions, and media.
WebSocket Gateway — The Heart of the System
WebSocket is the foundational protocol for real-time chat because it maintains a continuous bidirectional connection between client and server, unlike traditional HTTP request-response. When a user opens the app, a WebSocket connection is established and kept open throughout the session.
Connection Management
Each WebSocket server can manage 500K–1M concurrent connections using event-driven I/O (epoll on Linux). With 50 million concurrent connections, approximately 50–100 WebSocket servers are needed.
sequenceDiagram
participant C as Client
participant LB as Load Balancer
participant WS as WS Gateway
participant R as Redis (Session)
participant MS as Message Service
C->>LB: WebSocket Handshake
LB->>WS: Route (Sticky by User ID)
WS->>R: Register session
{userId: ws-server-3, connId: abc123}
R-->>WS: OK
WS-->>C: Connection Established
Note over C,WS: Heartbeat every 30s to keep connection alive
C->>WS: Send Message {to: user_B, text: "Hello!"}
WS->>MS: Route message
MS->>R: Lookup user_B session
R-->>MS: {server: ws-server-7, connId: xyz789}
MS->>WS: Deliver to ws-server-7
WS->>C: Message delivered ✓
1:1 message sending flow through WebSocket Gateway
Sticky Sessions vs Stateless Routing
The Load Balancer uses consistent hashing by User ID to route WebSocket connections to the same server. When a server crashes, only users on that server are affected and automatically reconnect to a new server. Redis stores the userId → wsServerId mapping so other services know where to deliver messages.
Heartbeat & Reconnection
Clients send heartbeat (ping/pong) every 30 seconds. If the server doesn't receive a heartbeat within 90 seconds, the connection is marked dead and the session is removed from Redis. Client-side retry logic uses exponential backoff: 1s → 2s → 4s → 8s → max 30s.
Message Routing — Delivering Messages to the Right Person
1:1 Chat Flow
When user A sends a message to user B:
- WS Gateway receives the message from A, assigns a
messageId(Snowflake ID — guaranteed unique + time-sortable) and timestamp - Message Service validates and enqueues to Kafka topic
messages - Consumer persists to database (write-ahead), sends ACK back to sender
- Router queries Redis to find user B's WebSocket server
- If B is online: push via WebSocket → B sends delivery receipt → status updated to
delivered - If B is offline: message stays in inbox, triggers push notification via FCM/APNs
Message Ordering Challenge
In distributed systems, messages can arrive out of order. Solution: use Snowflake IDs or Lamport timestamps — clients sort messages by ID instead of server-received time. Kafka partitions by conversationId to ensure ordering within the same conversation.
Group Chat — The Fan-out Problem
Group chat is the biggest scaling challenge. When 1 user sends a message to a group of 10,000 members, the system must fan-out the message to everyone — this is called write amplification.
| Strategy | Description | Pros | Cons |
|---|---|---|---|
| Fan-out on Write | Copy message to each member's inbox on send | Fast reads, simple client logic | High write amplification for large groups |
| Fan-out on Read | Store once, members query when opening chat | Efficient writes, saves storage | Slow reads for large groups, more complex |
| Hybrid | Write fan-out for small groups (<500), Read for large groups | Best balance | More complex logic |
graph LR
subgraph Write["Fan-out on Write"]
A1[User sends msg] --> B1[Message Service]
B1 --> C1[Copy → User 1 Inbox]
B1 --> D1[Copy → User 2 Inbox]
B1 --> E1[Copy → User N Inbox]
end
subgraph Read["Fan-out on Read"]
A2[User sends msg] --> B2[Message Service]
B2 --> C2[Store 1 copy
in Group Store]
D2[User 1 opens chat] --> C2
E2[User 2 opens chat] --> C2
end
style B1 fill:#e94560,stroke:#fff,color:#fff
style B2 fill:#e94560,stroke:#fff,color:#fff
style C2 fill:#2c3e50,stroke:#fff,color:#fff
Comparing two fan-out strategies for group chat
WhatsApp uses fan-out on write since groups max out at 1,024 members — write amplification is acceptable. Telegram uses a hybrid approach: fan-out on write for small groups, fan-out on read for supergroups with up to 200,000 members.
Database Design — Storing Billions of Messages
Choosing the Right Database
Chat messages have distinct characteristics: write-heavy (continuous writes), recent-read-heavy (mostly reading recent messages), append-only (no edits, rare deletes), time-series-like (queried by time). These characteristics favor wide-column stores over traditional RDBMS.
| Database | Write Throughput | Read Pattern | Use Case |
|---|---|---|---|
| Cassandra / ScyllaDB | Very high (100K+ writes/s) | Partition key scan | Primary message storage |
| PostgreSQL | Medium | Flexible (index, join) | User profiles, settings |
| Redis | Extremely high (in-memory) | Key-value, sorted set | Session, presence, recent messages cache |
| S3 / R2 | High | Object key | Media files (images, videos, files) |
Cassandra Schema Design
-- Message table: partition by conversation, cluster by time
CREATE TABLE messages (
conversation_id UUID,
message_id BIGINT, -- Snowflake ID (time-sortable)
sender_id UUID,
message_type TEXT, -- 'text', 'image', 'video', 'file'
content TEXT,
media_url TEXT,
status TEXT, -- 'sent', 'delivered', 'read'
created_at TIMESTAMP,
PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);
-- User conversation inbox
CREATE TABLE user_conversations (
user_id UUID,
last_message_at TIMESTAMP,
conversation_id UUID,
conversation_type TEXT, -- '1:1' or 'group'
last_message_preview TEXT,
unread_count INT,
PRIMARY KEY (user_id, last_message_at, conversation_id)
) WITH CLUSTERING ORDER BY (last_message_at DESC);
Why Snowflake ID Instead of UUID?
Snowflake ID (64-bit) consists of: 41 bits timestamp + 10 bits machine ID + 12 bits sequence number. Advantages: time-sortable (no additional timestamp column needed for sorting), smaller than UUID (8 bytes vs 16 bytes), and extremely fast generation without database roundtrips. Discord, Twitter, and Instagram all use variants of Snowflake ID.
Hot/Cold Storage Tiering
Not all messages are accessed frequently. Tiering strategy:
- Hot (0-7 days): Redis sorted set — ultra-fast queries, store last 50 messages per conversation
- Warm (7-90 days): Cassandra/ScyllaDB — primary storage, SSD-backed
- Cold (90+ days): Object storage (S3) as compressed Parquet — minimal cost, queryable via Spark/Presto when needed
Presence System — "Online", "Last seen", "Typing..."
The presence system tracks real-time user status: online, offline, last seen, typing. This seems simple but is very hard to scale because each status change must be broadcast to all of that user's contacts.
stateDiagram-v2
[*] --> Online: WebSocket connected
Online --> Away: Inactive > 5 minutes
Away --> Online: User interaction
Online --> Offline: Disconnect / Heartbeat timeout
Away --> Offline: Heartbeat timeout
Offline --> Online: Reconnect
Online --> Typing: Start typing
Typing --> Online: Stop typing > 3s
Typing --> Online: Send message
Presence System state machine
Scaling Presence
Naive approach: broadcast to all friends whenever a user changes status. If a user has 500 online friends → 500 WebSocket pushes. 1 million users changing status/minute → 500 million pushes/minute. Not feasible.
Smart approach — Lazy Presence:
- Store status in Redis with TTL (key:
presence:{userId}, value:online, TTL: 60s) - Heartbeat every 30s refreshes TTL — when TTL expires, automatically transitions to offline
- Only query presence when a user opens a conversation, don't broadcast to entire contact list
- Subscribe to presence changes of visible contacts via Redis Pub/Sub channels
WhatsApp Presence: "Last seen"
WhatsApp doesn't continuously broadcast online status. Instead, when user A opens a chat with B, the app queries B's "last seen". This is a pull-based design that saves significant bandwidth compared to push-based approaches. The "Typing..." indicator is only sent to users currently viewing that conversation.
Push Notifications for Offline Users
When the receiver is offline, the message is persisted to the database and triggers a push notification via FCM (Android) or APNs (iOS). Key design considerations:
- Batching: Combine multiple messages into a single notification if they arrive within a short window ("You have 5 new messages from Anh Tu")
- Priority: 1:1 messages → high priority (display immediately), group chat → normal priority (may be delayed by OS)
- Unread sync: When the user reopens the app, Sync Service pulls all unread messages from the database, not relying on push notifications
- Deduplication: Client uses
messageIdto dedup — if already received via WebSocket, ignore the notification
End-to-End Encryption (E2EE)
E2EE ensures only the sender and receiver can read message content — the server only sees ciphertext. WhatsApp and Signal use the Signal Protocol (Double Ratchet Algorithm + X3DH key agreement).
sequenceDiagram
participant A as Alice
participant S as Server
participant B as Bob
Note over A,B: Key Exchange (X3DH)
A->>S: Register Identity Key + Signed Pre-Key + One-time Pre-Keys
B->>S: Register Identity Key + Signed Pre-Key + One-time Pre-Keys
A->>S: Request Bob's Pre-Key Bundle
S-->>A: Bob's keys (Identity + Signed Pre + One-time)
Note over A: Compute Shared Secret via
X3DH key agreement
A->>S: Encrypted message (ciphertext)
S->>B: Forward ciphertext
Note over B: Compute Shared Secret
→ Decrypt message
Note over A,B: Double Ratchet: each message
uses a different key
End-to-End Encryption flow with Signal Protocol
Why Double Ratchet?
Forward secrecy: If the current key is compromised, attackers cannot decrypt past messages because each message uses its own key derived from the ratchet chain. Post-compromise security: After a key is compromised, the system self-heals by generating new keys through the DH ratchet on the next message.
Scaling Strategies
WebSocket Gateway Scaling
The WebSocket gateway is stateful (holds connections), so it can't be scaled as simply as stateless HTTP services. Strategies:
- Horizontal scaling: Add servers, use consistent hashing to distribute users. When adding/removing servers, only ~1/N users need to reconnect
- Connection draining: When shutting down a server, notify clients to reconnect gradually (graceful) instead of abrupt disconnection
- Regional deployment: Deploy WebSocket gateways across multiple regions, route users to the nearest server using GeoDNS
Message Database Sharding
graph LR
subgraph Shard["Sharding by conversation_id"]
M[Message] --> H{Hash
conversation_id}
H -->|hash % 4 = 0| S0[Shard 0]
H -->|hash % 4 = 1| S1[Shard 1]
H -->|hash % 4 = 2| S2[Shard 2]
H -->|hash % 4 = 3| S3[Shard 3]
end
style H fill:#e94560,stroke:#fff,color:#fff
style S0 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style S1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style S2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style S3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
Message database sharding by conversation_id
The shard key is conversation_id rather than user_id because: all messages in the same conversation reside on 1 shard → querying "get the latest 50 messages of conversation X" only hits 1 shard, no scatter-gather needed.
Recommended Tech Stack for Production
| Component | Technology | Rationale |
|---|---|---|
| WebSocket Gateway | .NET 10 (Kestrel) / Go | Kestrel handles hundreds of thousands of concurrent connections with low memory footprint |
| Message Queue | Apache Kafka | High throughput, partition-based ordering, replay capability |
| Message Store | ScyllaDB (Cassandra-compatible) | Extremely high write throughput, linear scalability, familiar CQL |
| Session/Presence Cache | Redis Cluster | Sub-millisecond latency, Pub/Sub for presence, automatic TTL cleanup |
| User/Group Metadata | PostgreSQL | ACID compliance for user profiles, group settings, permissions |
| Media Storage | S3 / Cloudflare R2 | S3-compatible, R2 has zero egress fees — ideal for chat images/videos |
| Search | Elasticsearch / Meilisearch | Full-text search across message history |
| Push Notifications | FCM + APNs | Platform-native, reliable, free |
Monitoring & Observability
For real-time systems, monitoring is mission-critical. Key metrics to track:
- Message delivery latency (P50, P95, P99): Time from sender sending to receiver receiving — target P99 < 200ms
- WebSocket connection count per server: Alert when exceeding 80% capacity
- Message queue lag: Consumer lag on Kafka — continuously increasing lag signals a bottleneck
- Failed delivery rate: Percentage of messages that fail to deliver after 3 retries
- Database write latency: P99 write latency of Cassandra — target < 10ms
Distributed Tracing for Message Flow
Attach a traceId to each message from the moment the sender sends it, propagating through WebSocket Gateway → Kafka → Consumer → Database → Receiver. Use OpenTelemetry + Jaeger/Tempo to trace the entire message journey — when a user reports "slow message delivery", you can pinpoint exactly where the bottleneck is.
Real-world Lessons from WhatsApp and Telegram
Conclusion
Designing a large-scale real-time Chat system requires careful consideration at multiple layers: from WebSocket connection management with consistent hashing and heartbeat, message routing with appropriate fan-out strategies, database design optimized for write-heavy workloads, to bandwidth-efficient presence systems and privacy-preserving E2E encryption.
There's no "one-size-fits-all" solution — WhatsApp chose Erlang and fan-out on write because of small groups, Telegram chose MTProto and hybrid fan-out for large supergroups, Discord rewrote in Rust due to memory pressure. The common thread: understanding the tradeoffs in every architectural decision, designing for failure (network partitions, server crashes), and always measuring with real metrics rather than guessing.
References:
OAuth 2.1 & OpenID Connect: Modern API Authentication in 2026
Server-Sent Events — Building a Real-time Dashboard with .NET 10, Vue 3 & Redis
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.