System Design: Real-time Chat at Million-User Scale

Posted on: 4/21/2026 8:15:59 PM

Table of contents

Why is Chat System Design a Hard Problem?
System Requirements Analysis
1. Functional Requirements
2. Non-functional Requirements
  1. Quick Estimation
High-Level Architecture
WebSocket Gateway — The Heart of the System
1. Connection Management
  1. Sticky Sessions vs Stateless Routing
2. Heartbeat & Reconnection
Message Routing — Delivering Messages to the Right Person
1. 1:1 Chat Flow
  1. Message Ordering Challenge
2. Group Chat — The Fan-out Problem
Database Design — Storing Billions of Messages
Presence System — "Online", "Last seen", "Typing..."
1. Scaling Presence
  1. WhatsApp Presence: "Last seen"
Push Notifications for Offline Users
End-to-End Encryption (E2EE)
1. Why Double Ratchet?
Scaling Strategies
1. WebSocket Gateway Scaling
2. Message Database Sharding
Recommended Tech Stack for Production
Monitoring & Observability
1. Distributed Tracing for Message Flow
Real-world Lessons from WhatsApp and Telegram
Conclusion

Why is Chat System Design a Hard Problem?

Every day, WhatsApp processes over 100 billion messages, and Telegram serves 900 million active users. Behind the simple interface — type, send, receive — lies an extremely complex distributed architecture that must simultaneously handle: real-time message delivery under 200ms latency, guarantee zero message loss, scale to millions of concurrent connections, and support group chats with up to 200,000 members.

This article dives deep into the architecture of a large-scale real-time Chat system — from WebSocket connection management, message routing, database design to presence systems and end-to-end encryption — from a practical production System Design perspective.

100B+ Messages/day on WhatsApp

<200ms Required chat latency

10M+ WebSocket connections/server

99.99% Minimum uptime SLA

System Requirements Analysis

Functional Requirements

1:1 Chat: Send and receive real-time messages between two users
Group Chat: Support groups with up to thousands of members
Online/Offline Status: Display user activity status
Read Receipts: Confirm message status (sent → delivered → read)
Push Notifications: Notify when user is offline
Media Sharing: Send photos, videos, and file attachments
Message History: Store and sync message history across multiple devices

Non-functional Requirements

Low Latency: End-to-end delivery under 200ms at the 99th percentile
High Availability: 99.99% uptime (maximum 52 minutes downtime/year)
Message Ordering: Messages must display in correct order within a conversation
At-least-once Delivery: No message loss, accept duplicates (client-side dedup)
Scalability: Horizontal scaling to 50M+ concurrent connections

Quick Estimation

Assuming 50 million DAU, each user sends an average of 40 messages/day → 2 billion messages/day → ~23,000 messages/second. Peak load at 3x → ~70,000 messages/second. Each message averages 200 bytes → storage ~400GB/day for raw messages, excluding metadata and indexes.

High-Level Architecture

graph TB
    subgraph Clients["📱 Clients"]
        C1[Mobile App]
        C2[Web App]
        C3[Desktop App]
    end

    subgraph Edge["Edge Layer"]
        LB[Load Balancer
Sticky Sessions]
    end

    subgraph WS["WebSocket Gateway Cluster"]
        WS1[WS Server 1
~500K connections]
        WS2[WS Server 2
~500K connections]
        WS3[WS Server N
~500K connections]
    end

    subgraph Services["Application Services"]
        MS[Message Service]
        PS[Presence Service]
        GS[Group Service]
        NS[Notification Service]
        SYNC[Sync Service]
    end

    subgraph MQ["Message Queue"]
        K[Kafka / RabbitMQ]
    end

    subgraph Storage["Data Layer"]
        DB[(Message DB
Cassandra/ScyllaDB)]
        CACHE[(Session Cache
Redis Cluster)]
        S3[Media Storage
S3/R2]
        IDX[(Search Index
Elasticsearch)]
    end

    C1 & C2 & C3 --> LB
    LB --> WS1 & WS2 & WS3
    WS1 & WS2 & WS3 --> MS
    MS --> K
    K --> DB
    K --> NS
    WS1 & WS2 & WS3 --> PS
    PS --> CACHE
    MS --> GS
    NS --> C1 & C2 & C3
    SYNC --> DB

    style LB fill:#e94560,stroke:#fff,color:#fff
    style K fill:#2c3e50,stroke:#fff,color:#fff
    style DB fill:#16213e,stroke:#fff,color:#fff
    style CACHE fill:#e94560,stroke:#fff,color:#fff

High-level architecture of a large-scale real-time Chat system

The architecture is divided into 4 main layers: Edge Layer handles connections and load balancing; WebSocket Gateway maintains persistent connections with clients; Application Services handle business logic; and Data Layer stores messages, sessions, and media.

WebSocket Gateway — The Heart of the System

WebSocket is the foundational protocol for real-time chat because it maintains a continuous bidirectional connection between client and server, unlike traditional HTTP request-response. When a user opens the app, a WebSocket connection is established and kept open throughout the session.

Connection Management

Each WebSocket server can manage 500K–1M concurrent connections using event-driven I/O (epoll on Linux). With 50 million concurrent connections, approximately 50–100 WebSocket servers are needed.

sequenceDiagram
    participant C as Client
    participant LB as Load Balancer
    participant WS as WS Gateway
    participant R as Redis (Session)
    participant MS as Message Service

    C->>LB: WebSocket Handshake
    LB->>WS: Route (Sticky by User ID)
    WS->>R: Register session
{userId: ws-server-3, connId: abc123}
    R-->>WS: OK
    WS-->>C: Connection Established

    Note over C,WS: Heartbeat every 30s to keep connection alive

    C->>WS: Send Message {to: user_B, text: "Hello!"}
    WS->>MS: Route message
    MS->>R: Lookup user_B session
    R-->>MS: {server: ws-server-7, connId: xyz789}
    MS->>WS: Deliver to ws-server-7
    WS->>C: Message delivered ✓

1:1 message sending flow through WebSocket Gateway

Sticky Sessions vs Stateless Routing

The Load Balancer uses consistent hashing by User ID to route WebSocket connections to the same server. When a server crashes, only users on that server are affected and automatically reconnect to a new server. Redis stores the userId → wsServerId mapping so other services know where to deliver messages.

Heartbeat & Reconnection

Clients send heartbeat (ping/pong) every 30 seconds. If the server doesn't receive a heartbeat within 90 seconds, the connection is marked dead and the session is removed from Redis. Client-side retry logic uses exponential backoff: 1s → 2s → 4s → 8s → max 30s.

Message Routing — Delivering Messages to the Right Person

1:1 Chat Flow

When user A sends a message to user B:

WS Gateway receives the message from A, assigns a messageId (Snowflake ID — guaranteed unique + time-sortable) and timestamp
Message Service validates and enqueues to Kafka topic messages
Consumer persists to database (write-ahead), sends ACK back to sender
Router queries Redis to find user B's WebSocket server
If B is online: push via WebSocket → B sends delivery receipt → status updated to delivered
If B is offline: message stays in inbox, triggers push notification via FCM/APNs

Message Ordering Challenge

In distributed systems, messages can arrive out of order. Solution: use Snowflake IDs or Lamport timestamps — clients sort messages by ID instead of server-received time. Kafka partitions by conversationId to ensure ordering within the same conversation.

Group Chat — The Fan-out Problem

Group chat is the biggest scaling challenge. When 1 user sends a message to a group of 10,000 members, the system must fan-out the message to everyone — this is called write amplification.

Strategy	Description	Pros	Cons
Fan-out on Write	Copy message to each member's inbox on send	Fast reads, simple client logic	High write amplification for large groups
Fan-out on Read	Store once, members query when opening chat	Efficient writes, saves storage	Slow reads for large groups, more complex
Hybrid	Write fan-out for small groups (<500), Read for large groups	Best balance	More complex logic

graph LR
    subgraph Write["Fan-out on Write"]
        A1[User sends msg] --> B1[Message Service]
        B1 --> C1[Copy → User 1 Inbox]
        B1 --> D1[Copy → User 2 Inbox]
        B1 --> E1[Copy → User N Inbox]
    end

    subgraph Read["Fan-out on Read"]
        A2[User sends msg] --> B2[Message Service]
        B2 --> C2[Store 1 copy
in Group Store]
        D2[User 1 opens chat] --> C2
        E2[User 2 opens chat] --> C2
    end

    style B1 fill:#e94560,stroke:#fff,color:#fff
    style B2 fill:#e94560,stroke:#fff,color:#fff
    style C2 fill:#2c3e50,stroke:#fff,color:#fff

Comparing two fan-out strategies for group chat

WhatsApp uses fan-out on write since groups max out at 1,024 members — write amplification is acceptable. Telegram uses a hybrid approach: fan-out on write for small groups, fan-out on read for supergroups with up to 200,000 members.

Database Design — Storing Billions of Messages

Choosing the Right Database

Chat messages have distinct characteristics: write-heavy (continuous writes), recent-read-heavy (mostly reading recent messages), append-only (no edits, rare deletes), time-series-like (queried by time). These characteristics favor wide-column stores over traditional RDBMS.

Database	Write Throughput	Read Pattern	Use Case
Cassandra / ScyllaDB	Very high (100K+ writes/s)	Partition key scan	Primary message storage
PostgreSQL	Medium	Flexible (index, join)	User profiles, settings
Redis	Extremely high (in-memory)	Key-value, sorted set	Session, presence, recent messages cache
S3 / R2	High	Object key	Media files (images, videos, files)

Cassandra Schema Design

-- Message table: partition by conversation, cluster by time
CREATE TABLE messages (
    conversation_id UUID,
    message_id BIGINT,       -- Snowflake ID (time-sortable)
    sender_id UUID,
    message_type TEXT,        -- 'text', 'image', 'video', 'file'
    content TEXT,
    media_url TEXT,
    status TEXT,              -- 'sent', 'delivered', 'read'
    created_at TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- User conversation inbox
CREATE TABLE user_conversations (
    user_id UUID,
    last_message_at TIMESTAMP,
    conversation_id UUID,
    conversation_type TEXT,   -- '1:1' or 'group'
    last_message_preview TEXT,
    unread_count INT,
    PRIMARY KEY (user_id, last_message_at, conversation_id)
) WITH CLUSTERING ORDER BY (last_message_at DESC);

Why Snowflake ID Instead of UUID?

Snowflake ID (64-bit) consists of: 41 bits timestamp + 10 bits machine ID + 12 bits sequence number. Advantages: time-sortable (no additional timestamp column needed for sorting), smaller than UUID (8 bytes vs 16 bytes), and extremely fast generation without database roundtrips. Discord, Twitter, and Instagram all use variants of Snowflake ID.

Hot/Cold Storage Tiering

Not all messages are accessed frequently. Tiering strategy:

Hot (0-7 days): Redis sorted set — ultra-fast queries, store last 50 messages per conversation
Warm (7-90 days): Cassandra/ScyllaDB — primary storage, SSD-backed
Cold (90+ days): Object storage (S3) as compressed Parquet — minimal cost, queryable via Spark/Presto when needed

Presence System — "Online", "Last seen", "Typing..."

The presence system tracks real-time user status: online, offline, last seen, typing. This seems simple but is very hard to scale because each status change must be broadcast to all of that user's contacts.

stateDiagram-v2
    [*] --> Online: WebSocket connected
    Online --> Away: Inactive > 5 minutes
    Away --> Online: User interaction
    Online --> Offline: Disconnect / Heartbeat timeout
    Away --> Offline: Heartbeat timeout
    Offline --> Online: Reconnect

    Online --> Typing: Start typing
    Typing --> Online: Stop typing > 3s
    Typing --> Online: Send message

Presence System state machine

Scaling Presence

Naive approach: broadcast to all friends whenever a user changes status. If a user has 500 online friends → 500 WebSocket pushes. 1 million users changing status/minute → 500 million pushes/minute. Not feasible.

Smart approach — Lazy Presence:

Store status in Redis with TTL (key: presence:{userId}, value: online, TTL: 60s)
Heartbeat every 30s refreshes TTL — when TTL expires, automatically transitions to offline
Only query presence when a user opens a conversation, don't broadcast to entire contact list
Subscribe to presence changes of visible contacts via Redis Pub/Sub channels

WhatsApp Presence: "Last seen"

WhatsApp doesn't continuously broadcast online status. Instead, when user A opens a chat with B, the app queries B's "last seen". This is a pull-based design that saves significant bandwidth compared to push-based approaches. The "Typing..." indicator is only sent to users currently viewing that conversation.

Push Notifications for Offline Users

When the receiver is offline, the message is persisted to the database and triggers a push notification via FCM (Android) or APNs (iOS). Key design considerations:

Batching: Combine multiple messages into a single notification if they arrive within a short window ("You have 5 new messages from Anh Tu")
Priority: 1:1 messages → high priority (display immediately), group chat → normal priority (may be delayed by OS)
Unread sync: When the user reopens the app, Sync Service pulls all unread messages from the database, not relying on push notifications
Deduplication: Client uses messageId to dedup — if already received via WebSocket, ignore the notification

End-to-End Encryption (E2EE)

E2EE ensures only the sender and receiver can read message content — the server only sees ciphertext. WhatsApp and Signal use the Signal Protocol (Double Ratchet Algorithm + X3DH key agreement).

sequenceDiagram
    participant A as Alice
    participant S as Server
    participant B as Bob

    Note over A,B: Key Exchange (X3DH)
    A->>S: Register Identity Key + Signed Pre-Key + One-time Pre-Keys
    B->>S: Register Identity Key + Signed Pre-Key + One-time Pre-Keys

    A->>S: Request Bob's Pre-Key Bundle
    S-->>A: Bob's keys (Identity + Signed Pre + One-time)

    Note over A: Compute Shared Secret via
X3DH key agreement

    A->>S: Encrypted message (ciphertext)
    S->>B: Forward ciphertext
    Note over B: Compute Shared Secret
→ Decrypt message

    Note over A,B: Double Ratchet: each message
uses a different key

End-to-End Encryption flow with Signal Protocol

Why Double Ratchet?

Forward secrecy: If the current key is compromised, attackers cannot decrypt past messages because each message uses its own key derived from the ratchet chain. Post-compromise security: After a key is compromised, the system self-heals by generating new keys through the DH ratchet on the next message.

Scaling Strategies

WebSocket Gateway Scaling

The WebSocket gateway is stateful (holds connections), so it can't be scaled as simply as stateless HTTP services. Strategies:

Horizontal scaling: Add servers, use consistent hashing to distribute users. When adding/removing servers, only ~1/N users need to reconnect
Connection draining: When shutting down a server, notify clients to reconnect gradually (graceful) instead of abrupt disconnection
Regional deployment: Deploy WebSocket gateways across multiple regions, route users to the nearest server using GeoDNS

Message Database Sharding

graph LR
    subgraph Shard["Sharding by conversation_id"]
        M[Message] --> H{Hash
conversation_id}
        H -->|hash % 4 = 0| S0[Shard 0]
        H -->|hash % 4 = 1| S1[Shard 1]
        H -->|hash % 4 = 2| S2[Shard 2]
        H -->|hash % 4 = 3| S3[Shard 3]
    end

    style H fill:#e94560,stroke:#fff,color:#fff
    style S0 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

Message database sharding by conversation_id

The shard key is conversation_id rather than user_id because: all messages in the same conversation reside on 1 shard → querying "get the latest 50 messages of conversation X" only hits 1 shard, no scatter-gather needed.

Recommended Tech Stack for Production

Component	Technology	Rationale
WebSocket Gateway	.NET 10 (Kestrel) / Go	Kestrel handles hundreds of thousands of concurrent connections with low memory footprint
Message Queue	Apache Kafka	High throughput, partition-based ordering, replay capability
Message Store	ScyllaDB (Cassandra-compatible)	Extremely high write throughput, linear scalability, familiar CQL
Session/Presence Cache	Redis Cluster	Sub-millisecond latency, Pub/Sub for presence, automatic TTL cleanup
User/Group Metadata	PostgreSQL	ACID compliance for user profiles, group settings, permissions
Media Storage	S3 / Cloudflare R2	S3-compatible, R2 has zero egress fees — ideal for chat images/videos
Search	Elasticsearch / Meilisearch	Full-text search across message history
Push Notifications	FCM + APNs	Platform-native, reliable, free

Monitoring & Observability

For real-time systems, monitoring is mission-critical. Key metrics to track:

Message delivery latency (P50, P95, P99): Time from sender sending to receiver receiving — target P99 < 200ms
WebSocket connection count per server: Alert when exceeding 80% capacity
Message queue lag: Consumer lag on Kafka — continuously increasing lag signals a bottleneck
Failed delivery rate: Percentage of messages that fail to deliver after 3 retries
Database write latency: P99 write latency of Cassandra — target < 10ms

Distributed Tracing for Message Flow

Attach a traceId to each message from the moment the sender sends it, propagating through WebSocket Gateway → Kafka → Consumer → Database → Receiver. Use OpenTelemetry + Jaeger/Tempo to trace the entire message journey — when a user reports "slow message delivery", you can pinpoint exactly where the bottleneck is.

Real-world Lessons from WhatsApp and Telegram

WhatsApp — Erlang/FreeBSD

WhatsApp is famous for having only 50 engineers serve 2 billion users. The secret: Erlang VM (BEAM), designed for telecom — handling millions of lightweight processes, hot code reload, fault-tolerant. Each WhatsApp server handles 2-3 million connections.

Telegram — MTProto + Bare-metal

Telegram developed the MTProto protocol instead of using standard TLS, optimized for mobile (fewer round-trips, smaller packets). Infrastructure runs on bare-metal servers, not cloud — reducing latency and costs at large scale.

Discord — Elixir + Rust

Discord started with Elixir (BEAM VM, like Erlang) for the WebSocket gateway, then rewrote hot paths in Rust to optimize memory. A single Discord server handles up to 1 million concurrent connections.

Slack — Java + WebSocket

Slack uses Java for the main backend, WebSocket for real-time messaging, and MySQL for persistence (sharded). As they scaled, Slack encountered MySQL sharding issues and gradually migrated to Vitess (MySQL proxy).

Conclusion

Designing a large-scale real-time Chat system requires careful consideration at multiple layers: from WebSocket connection management with consistent hashing and heartbeat, message routing with appropriate fan-out strategies, database design optimized for write-heavy workloads, to bandwidth-efficient presence systems and privacy-preserving E2E encryption.

There's no "one-size-fits-all" solution — WhatsApp chose Erlang and fan-out on write because of small groups, Telegram chose MTProto and hybrid fan-out for large supergroups, Discord rewrote in Rust due to memory pressure. The common thread: understanding the tradeoffs in every architectural decision, designing for failure (network partitions, server crashes), and always measuring with real metrics rather than guessing.

References:

#system design #Kafka #Distributed Systems #WebSocket #Cassandra #End-to-End Encryption #Real-Time #Chat System

# System Design: Real-time Chat at Million-User Scale

## Why is Chat System Design a Hard Problem?

100B+ Messages/day on WhatsApp

<200ms Required chat latency

10M+ WebSocket connections/server

99.99% Minimum uptime SLA

## System Requirements Analysis

### Functional Requirements

- **1:1 Chat:** Send and receive real-time messages between two users
- **Group Chat:** Support groups with up to thousands of members
- **Online/Offline Status:** Display user activity status
- **Read Receipts:** Confirm message status (sent → delivered → read)
- **Push Notifications:** Notify when user is offline
- **Media Sharing:** Send photos, videos, and file attachments
- **Message History:** Store and sync message history across multiple devices

### Non-functional Requirements

- **Low Latency:** End-to-end delivery under 200ms at the 99th percentile
- **High Availability:** 99.99% uptime (maximum 52 minutes downtime/year)
- **Message Ordering:** Messages must display in correct order within a conversation
- **At-least-once Delivery:** No message loss, accept duplicates (client-side dedup)
- **Scalability:** Horizontal scaling to 50M+ concurrent connections

#### Quick Estimation

## High-Level Architecture

```
graph TB
    subgraph Clients["📱 Clients"]
        C1[Mobile App]
        C2[Web App]
        C3[Desktop App]
    end

subgraph Edge["Edge Layer"]
        LB[Load Balancer  
Sticky Sessions]
    end

subgraph WS["WebSocket Gateway Cluster"]
        WS1[WS Server 1  
~500K connections]
        WS2[WS Server 2  
~500K connections]
        WS3[WS Server N  
~500K connections]
    end

subgraph Services["Application Services"]
        MS[Message Service]
        PS[Presence Service]
        GS[Group Service]
        NS[Notification Service]
        SYNC[Sync Service]
    end

subgraph MQ["Message Queue"]
        K[Kafka / RabbitMQ]
    end

subgraph Storage["Data Layer"]
        DB[(Message DB  
Cassandra/ScyllaDB)]
        CACHE[(Session Cache  
Redis Cluster)]
        S3[Media Storage  
S3/R2]
        IDX[(Search Index  
Elasticsearch)]
    end

C1 & C2 & C3 --> LB
    LB --> WS1 & WS2 & WS3
    WS1 & WS2 & WS3 --> MS
    MS --> K
    K --> DB
    K --> NS
    WS1 & WS2 & WS3 --> PS
    PS --> CACHE
    MS --> GS
    NS --> C1 & C2 & C3
    SYNC --> DB

style LB fill:#e94560,stroke:#fff,color:#fff
    style K fill:#2c3e50,stroke:#fff,color:#fff
    style DB fill:#16213e,stroke:#fff,color:#fff
    style CACHE fill:#e94560,stroke:#fff,color:#fff

```
High-level architecture of a large-scale real-time Chat system

The architecture is divided into 4 main layers: **Edge Layer** handles connections and load balancing; **WebSocket Gateway** maintains persistent connections with clients; **Application Services** handle business logic; and **Data Layer** stores messages, sessions, and media.

## WebSocket Gateway — The Heart of the System

### Connection Management

Each WebSocket server can manage 500K–1M concurrent connections using event-driven I/O (epoll on Linux). With 50 million concurrent connections, approximately 50–100 WebSocket servers are needed.

```
sequenceDiagram
    participant C as Client
    participant LB as Load Balancer
    participant WS as WS Gateway
    participant R as Redis (Session)
    participant MS as Message Service

C->>LB: WebSocket Handshake
    LB->>WS: Route (Sticky by User ID)
    WS->>R: Register session  
{userId: ws-server-3, connId: abc123}
    R-->>WS: OK
    WS-->>C: Connection Established

Note over C,WS: Heartbeat every 30s to keep connection alive

C->>WS: Send Message {to: user_B, text: "Hello!"}
    WS->>MS: Route message
    MS->>R: Lookup user_B session
    R-->>MS: {server: ws-server-7, connId: xyz789}
    MS->>WS: Deliver to ws-server-7
    WS->>C: Message delivered ✓

```
1:1 message sending flow through WebSocket Gateway

#### Sticky Sessions vs Stateless Routing

The Load Balancer uses **consistent hashing** by User ID to route WebSocket connections to the same server. When a server crashes, only users on that server are affected and automatically reconnect to a new server. Redis stores the `userId → wsServerId` mapping so other services know where to deliver messages.

### Heartbeat & Reconnection

## Message Routing — Delivering Messages to the Right Person

### 1:1 Chat Flow

When user A sends a message to user B:

1. **WS Gateway** receives the message from A, assigns a `messageId` (Snowflake ID — guaranteed unique + time-sortable) and timestamp
2. **Message Service** validates and enqueues to Kafka topic `messages`
3. **Consumer** persists to database (write-ahead), sends ACK back to sender
4. **Router** queries Redis to find user B's WebSocket server
5. If B is **online**: push via WebSocket → B sends delivery receipt → status updated to `delivered`
6. If B is **offline**: message stays in inbox, triggers push notification via FCM/APNs

#### Message Ordering Challenge

In distributed systems, messages can arrive out of order. Solution: use **Snowflake IDs** or **Lamport timestamps** — clients sort messages by ID instead of server-received time. Kafka partitions by `conversationId` to ensure ordering within the same conversation.

### Group Chat — The Fan-out Problem

Group chat is the biggest scaling challenge. When 1 user sends a message to a group of 10,000 members, the system must fan-out the message to everyone — this is called **write amplification**.

| Strategy | Description | Pros | Cons |
| --- | --- | --- | --- |
| **Fan-out on Write** | Copy message to each member's inbox on send | Fast reads, simple client logic | High write amplification for large groups |
| **Fan-out on Read** | Store once, members query when opening chat | Efficient writes, saves storage | Slow reads for large groups, more complex |
| **Hybrid** | Write fan-out for small groups (<500), Read for large groups | Best balance | More complex logic |

```
graph LR
    subgraph Write["Fan-out on Write"]
        A1[User sends msg] --> B1[Message Service]
        B1 --> C1[Copy → User 1 Inbox]
        B1 --> D1[Copy → User 2 Inbox]
        B1 --> E1[Copy → User N Inbox]
    end

subgraph Read["Fan-out on Read"]
        A2[User sends msg] --> B2[Message Service]
        B2 --> C2[Store 1 copy  
in Group Store]
        D2[User 1 opens chat] --> C2
        E2[User 2 opens chat] --> C2
    end

style B1 fill:#e94560,stroke:#fff,color:#fff
    style B2 fill:#e94560,stroke:#fff,color:#fff
    style C2 fill:#2c3e50,stroke:#fff,color:#fff

```
Comparing two fan-out strategies for group chat

WhatsApp uses **fan-out on write** since groups max out at 1,024 members — write amplification is acceptable. Telegram uses a **hybrid** approach: fan-out on write for small groups, fan-out on read for supergroups with up to 200,000 members.

## Database Design — Storing Billions of Messages

### Choosing the Right Database

Chat messages have distinct characteristics: **write-heavy** (continuous writes), **recent-read-heavy** (mostly reading recent messages), **append-only** (no edits, rare deletes), **time-series-like** (queried by time). These characteristics favor wide-column stores over traditional RDBMS.

| Database | Write Throughput | Read Pattern | Use Case |
| --- | --- | --- | --- |
| **Cassandra / ScyllaDB** | Very high (100K+ writes/s) | Partition key scan | Primary message storage |
| **PostgreSQL** | Medium | Flexible (index, join) | User profiles, settings |
| **Redis** | Extremely high (in-memory) | Key-value, sorted set | Session, presence, recent messages cache |
| **S3 / R2** | High | Object key | Media files (images, videos, files) |

### Cassandra Schema Design

```
-- Message table: partition by conversation, cluster by time
CREATE TABLE messages (
    conversation_id UUID,
    message_id BIGINT,       -- Snowflake ID (time-sortable)
    sender_id UUID,
    message_type TEXT,        -- 'text', 'image', 'video', 'file'
    content TEXT,
    media_url TEXT,
    status TEXT,              -- 'sent', 'delivered', 'read'
    created_at TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- User conversation inbox
CREATE TABLE user_conversations (
    user_id UUID,
    last_message_at TIMESTAMP,
    conversation_id UUID,
    conversation_type TEXT,   -- '1:1' or 'group'
    last_message_preview TEXT,
    unread_count INT,
    PRIMARY KEY (user_id, last_message_at, conversation_id)
) WITH CLUSTERING ORDER BY (last_message_at DESC);
```

#### Why Snowflake ID Instead of UUID?

Snowflake ID (64-bit) consists of: 41 bits timestamp + 10 bits machine ID + 12 bits sequence number. Advantages: **time-sortable** (no additional timestamp column needed for sorting), smaller than UUID (8 bytes vs 16 bytes), and extremely fast generation without database roundtrips. Discord, Twitter, and Instagram all use variants of Snowflake ID.

### Hot/Cold Storage Tiering

Not all messages are accessed frequently. Tiering strategy:

- **Hot (0-7 days):** Redis sorted set — ultra-fast queries, store last 50 messages per conversation
- **Warm (7-90 days):** Cassandra/ScyllaDB — primary storage, SSD-backed
- **Cold (90+ days):** Object storage (S3) as compressed Parquet — minimal cost, queryable via Spark/Presto when needed

## Presence System — "Online", "Last seen", "Typing..."

```
stateDiagram-v2
    [*] --> Online: WebSocket connected
    Online --> Away: Inactive > 5 minutes
    Away --> Online: User interaction
    Online --> Offline: Disconnect / Heartbeat timeout
    Away --> Offline: Heartbeat timeout
    Offline --> Online: Reconnect

Online --> Typing: Start typing
    Typing --> Online: Stop typing > 3s
    Typing --> Online: Send message

```
Presence System state machine

### Scaling Presence

Smart approach — **Lazy Presence**:

1. Store status in **Redis** with TTL (key: `presence:{userId}`, value: `online`, TTL: 60s)
2. Heartbeat every 30s **refreshes TTL** — when TTL expires, automatically transitions to offline
3. Only query presence **when a user opens a conversation**, don't broadcast to entire contact list
4. Subscribe to presence changes of **visible contacts** via Redis Pub/Sub channels

#### WhatsApp Presence: "Last seen"

## Push Notifications for Offline Users

When the receiver is offline, the message is persisted to the database and triggers a push notification via FCM (Android) or APNs (iOS). Key design considerations:

- **Batching:** Combine multiple messages into a single notification if they arrive within a short window ("You have 5 new messages from Anh Tu")
- **Priority:** 1:1 messages → high priority (display immediately), group chat → normal priority (may be delayed by OS)
- **Unread sync:** When the user reopens the app, Sync Service pulls all unread messages from the database, not relying on push notifications
- **Deduplication:** Client uses `messageId` to dedup — if already received via WebSocket, ignore the notification

## End-to-End Encryption (E2EE)

E2EE ensures only the sender and receiver can read message content — the server only sees ciphertext. WhatsApp and Signal use the **Signal Protocol** (Double Ratchet Algorithm + X3DH key agreement).

```
sequenceDiagram
    participant A as Alice
    participant S as Server
    participant B as Bob

Note over A,B: Key Exchange (X3DH)
    A->>S: Register Identity Key + Signed Pre-Key + One-time Pre-Keys
    B->>S: Register Identity Key + Signed Pre-Key + One-time Pre-Keys

A->>S: Request Bob's Pre-Key Bundle
    S-->>A: Bob's keys (Identity + Signed Pre + One-time)

Note over A: Compute Shared Secret via  
X3DH key agreement

A->>S: Encrypted message (ciphertext)
    S->>B: Forward ciphertext
    Note over B: Compute Shared Secret  
→ Decrypt message

Note over A,B: Double Ratchet: each message  
uses a different key

```
End-to-End Encryption flow with Signal Protocol

#### Why Double Ratchet?

**Forward secrecy:** If the current key is compromised, attackers cannot decrypt past messages because each message uses its own key derived from the ratchet chain. **Post-compromise security:** After a key is compromised, the system self-heals by generating new keys through the DH ratchet on the next message.

## Scaling Strategies

### WebSocket Gateway Scaling

The WebSocket gateway is stateful (holds connections), so it can't be scaled as simply as stateless HTTP services. Strategies:

- **Horizontal scaling:** Add servers, use consistent hashing to distribute users. When adding/removing servers, only ~1/N users need to reconnect
- **Connection draining:** When shutting down a server, notify clients to reconnect gradually (graceful) instead of abrupt disconnection
- **Regional deployment:** Deploy WebSocket gateways across multiple regions, route users to the nearest server using GeoDNS

### Message Database Sharding

```
graph LR
    subgraph Shard["Sharding by conversation_id"]
        M[Message] --> H{Hash  
conversation_id}
        H -->|hash % 4 = 0| S0[Shard 0]
        H -->|hash % 4 = 1| S1[Shard 1]
        H -->|hash % 4 = 2| S2[Shard 2]
        H -->|hash % 4 = 3| S3[Shard 3]
    end

style H fill:#e94560,stroke:#fff,color:#fff
    style S0 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

```
Message database sharding by conversation_id

The shard key is `conversation_id` rather than `user_id` because: all messages in the same conversation reside on 1 shard → querying "get the latest 50 messages of conversation X" only hits 1 shard, no scatter-gather needed.

## Recommended Tech Stack for Production

| Component | Technology | Rationale |
| --- | --- | --- |
| WebSocket Gateway | .NET 10 (Kestrel) / Go | Kestrel handles hundreds of thousands of concurrent connections with low memory footprint |
| Message Queue | Apache Kafka | High throughput, partition-based ordering, replay capability |
| Message Store | ScyllaDB (Cassandra-compatible) | Extremely high write throughput, linear scalability, familiar CQL |
| Session/Presence Cache | Redis Cluster | Sub-millisecond latency, Pub/Sub for presence, automatic TTL cleanup |
| User/Group Metadata | PostgreSQL | ACID compliance for user profiles, group settings, permissions |
| Media Storage | S3 / Cloudflare R2 | S3-compatible, R2 has zero egress fees — ideal for chat images/videos |
| Search | Elasticsearch / Meilisearch | Full-text search across message history |
| Push Notifications | FCM + APNs | Platform-native, reliable, free |

## Monitoring & Observability

For real-time systems, monitoring is mission-critical. Key metrics to track:

- **Message delivery latency (P50, P95, P99):** Time from sender sending to receiver receiving — target P99 < 200ms
- **WebSocket connection count per server:** Alert when exceeding 80% capacity
- **Message queue lag:** Consumer lag on Kafka — continuously increasing lag signals a bottleneck
- **Failed delivery rate:** Percentage of messages that fail to deliver after 3 retries
- **Database write latency:** P99 write latency of Cassandra — target < 10ms

#### Distributed Tracing for Message Flow

Attach a `traceId` to each message from the moment the sender sends it, propagating through WebSocket Gateway → Kafka → Consumer → Database → Receiver. Use OpenTelemetry + Jaeger/Tempo to trace the entire message journey — when a user reports "slow message delivery", you can pinpoint exactly where the bottleneck is.

## Real-world Lessons from WhatsApp and Telegram

WhatsApp — Erlang/FreeBSD

Telegram — MTProto + Bare-metal

Discord — Elixir + Rust

Slack — Java + WebSocket

## Conclusion

There's no "one-size-fits-all" solution — WhatsApp chose Erlang and fan-out on write because of small groups, Telegram chose MTProto and hybrid fan-out for large supergroups, Discord rewrote in Rust due to memory pressure. The common thread: **understanding the tradeoffs in every architectural decision**, designing for failure (network partitions, server crashes), and always measuring with real metrics rather than guessing.

References:

- [WhatsApp System Design — System Design Newsletter](https://newsletter.systemdesign.one/p/whatsapp-system-design)
- [Chat App System Design & Architecture (Complete 2026 Guide) — MirrorFly](https://www.mirrorfly.com/blog/chat-app-system-design/)
- [Design a Messaging App Like WhatsApp — Hello Interview](https://www.hellointerview.com/learn/system-design/problem-breakdowns/whatsapp)
- [Messenger System Design is WRONG — andreyka26](https://andreyka26.com/messenger-system-design-is-wrong)
- [Designing WhatsApp Messenger System Design — GeeksforGeeks](https://www.geeksforgeeks.org/system-design/designing-whatsapp-messenger-system-design/)

OAuth 2.1 & OpenID Connect: Modern API Authentication in 2026

Server-Sent Events — Building a Real-time Dashboard with .NET 10, Vue 3 & Redis

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.