System Design: Real-time Chat at Million-User Scale

Posted on: 4/21/2026 8:15:59 PM

Why is Chat System Design a Hard Problem?

Every day, WhatsApp processes over 100 billion messages, and Telegram serves 900 million active users. Behind the simple interface — type, send, receive — lies an extremely complex distributed architecture that must simultaneously handle: real-time message delivery under 200ms latency, guarantee zero message loss, scale to millions of concurrent connections, and support group chats with up to 200,000 members.

This article dives deep into the architecture of a large-scale real-time Chat system — from WebSocket connection management, message routing, database design to presence systems and end-to-end encryption — from a practical production System Design perspective.

100B+ Messages/day on WhatsApp
<200ms Required chat latency
10M+ WebSocket connections/server
99.99% Minimum uptime SLA

System Requirements Analysis

Functional Requirements

  • 1:1 Chat: Send and receive real-time messages between two users
  • Group Chat: Support groups with up to thousands of members
  • Online/Offline Status: Display user activity status
  • Read Receipts: Confirm message status (sent → delivered → read)
  • Push Notifications: Notify when user is offline
  • Media Sharing: Send photos, videos, and file attachments
  • Message History: Store and sync message history across multiple devices

Non-functional Requirements

  • Low Latency: End-to-end delivery under 200ms at the 99th percentile
  • High Availability: 99.99% uptime (maximum 52 minutes downtime/year)
  • Message Ordering: Messages must display in correct order within a conversation
  • At-least-once Delivery: No message loss, accept duplicates (client-side dedup)
  • Scalability: Horizontal scaling to 50M+ concurrent connections

Quick Estimation

Assuming 50 million DAU, each user sends an average of 40 messages/day → 2 billion messages/day → ~23,000 messages/second. Peak load at 3x → ~70,000 messages/second. Each message averages 200 bytes → storage ~400GB/day for raw messages, excluding metadata and indexes.

High-Level Architecture

graph TB
    subgraph Clients["📱 Clients"]
        C1[Mobile App]
        C2[Web App]
        C3[Desktop App]
    end

    subgraph Edge["Edge Layer"]
        LB[Load Balancer
Sticky Sessions] end subgraph WS["WebSocket Gateway Cluster"] WS1[WS Server 1
~500K connections] WS2[WS Server 2
~500K connections] WS3[WS Server N
~500K connections] end subgraph Services["Application Services"] MS[Message Service] PS[Presence Service] GS[Group Service] NS[Notification Service] SYNC[Sync Service] end subgraph MQ["Message Queue"] K[Kafka / RabbitMQ] end subgraph Storage["Data Layer"] DB[(Message DB
Cassandra/ScyllaDB)] CACHE[(Session Cache
Redis Cluster)] S3[Media Storage
S3/R2] IDX[(Search Index
Elasticsearch)] end C1 & C2 & C3 --> LB LB --> WS1 & WS2 & WS3 WS1 & WS2 & WS3 --> MS MS --> K K --> DB K --> NS WS1 & WS2 & WS3 --> PS PS --> CACHE MS --> GS NS --> C1 & C2 & C3 SYNC --> DB style LB fill:#e94560,stroke:#fff,color:#fff style K fill:#2c3e50,stroke:#fff,color:#fff style DB fill:#16213e,stroke:#fff,color:#fff style CACHE fill:#e94560,stroke:#fff,color:#fff

High-level architecture of a large-scale real-time Chat system

The architecture is divided into 4 main layers: Edge Layer handles connections and load balancing; WebSocket Gateway maintains persistent connections with clients; Application Services handle business logic; and Data Layer stores messages, sessions, and media.

WebSocket Gateway — The Heart of the System

WebSocket is the foundational protocol for real-time chat because it maintains a continuous bidirectional connection between client and server, unlike traditional HTTP request-response. When a user opens the app, a WebSocket connection is established and kept open throughout the session.

Connection Management

Each WebSocket server can manage 500K–1M concurrent connections using event-driven I/O (epoll on Linux). With 50 million concurrent connections, approximately 50–100 WebSocket servers are needed.

sequenceDiagram
    participant C as Client
    participant LB as Load Balancer
    participant WS as WS Gateway
    participant R as Redis (Session)
    participant MS as Message Service

    C->>LB: WebSocket Handshake
    LB->>WS: Route (Sticky by User ID)
    WS->>R: Register session
{userId: ws-server-3, connId: abc123} R-->>WS: OK WS-->>C: Connection Established Note over C,WS: Heartbeat every 30s to keep connection alive C->>WS: Send Message {to: user_B, text: "Hello!"} WS->>MS: Route message MS->>R: Lookup user_B session R-->>MS: {server: ws-server-7, connId: xyz789} MS->>WS: Deliver to ws-server-7 WS->>C: Message delivered ✓

1:1 message sending flow through WebSocket Gateway

Sticky Sessions vs Stateless Routing

The Load Balancer uses consistent hashing by User ID to route WebSocket connections to the same server. When a server crashes, only users on that server are affected and automatically reconnect to a new server. Redis stores the userId → wsServerId mapping so other services know where to deliver messages.

Heartbeat & Reconnection

Clients send heartbeat (ping/pong) every 30 seconds. If the server doesn't receive a heartbeat within 90 seconds, the connection is marked dead and the session is removed from Redis. Client-side retry logic uses exponential backoff: 1s → 2s → 4s → 8s → max 30s.

Message Routing — Delivering Messages to the Right Person

1:1 Chat Flow

When user A sends a message to user B:

  1. WS Gateway receives the message from A, assigns a messageId (Snowflake ID — guaranteed unique + time-sortable) and timestamp
  2. Message Service validates and enqueues to Kafka topic messages
  3. Consumer persists to database (write-ahead), sends ACK back to sender
  4. Router queries Redis to find user B's WebSocket server
  5. If B is online: push via WebSocket → B sends delivery receipt → status updated to delivered
  6. If B is offline: message stays in inbox, triggers push notification via FCM/APNs

Message Ordering Challenge

In distributed systems, messages can arrive out of order. Solution: use Snowflake IDs or Lamport timestamps — clients sort messages by ID instead of server-received time. Kafka partitions by conversationId to ensure ordering within the same conversation.

Group Chat — The Fan-out Problem

Group chat is the biggest scaling challenge. When 1 user sends a message to a group of 10,000 members, the system must fan-out the message to everyone — this is called write amplification.

Strategy Description Pros Cons
Fan-out on Write Copy message to each member's inbox on send Fast reads, simple client logic High write amplification for large groups
Fan-out on Read Store once, members query when opening chat Efficient writes, saves storage Slow reads for large groups, more complex
Hybrid Write fan-out for small groups (<500), Read for large groups Best balance More complex logic
graph LR
    subgraph Write["Fan-out on Write"]
        A1[User sends msg] --> B1[Message Service]
        B1 --> C1[Copy → User 1 Inbox]
        B1 --> D1[Copy → User 2 Inbox]
        B1 --> E1[Copy → User N Inbox]
    end

    subgraph Read["Fan-out on Read"]
        A2[User sends msg] --> B2[Message Service]
        B2 --> C2[Store 1 copy
in Group Store] D2[User 1 opens chat] --> C2 E2[User 2 opens chat] --> C2 end style B1 fill:#e94560,stroke:#fff,color:#fff style B2 fill:#e94560,stroke:#fff,color:#fff style C2 fill:#2c3e50,stroke:#fff,color:#fff

Comparing two fan-out strategies for group chat

WhatsApp uses fan-out on write since groups max out at 1,024 members — write amplification is acceptable. Telegram uses a hybrid approach: fan-out on write for small groups, fan-out on read for supergroups with up to 200,000 members.

Database Design — Storing Billions of Messages

Choosing the Right Database

Chat messages have distinct characteristics: write-heavy (continuous writes), recent-read-heavy (mostly reading recent messages), append-only (no edits, rare deletes), time-series-like (queried by time). These characteristics favor wide-column stores over traditional RDBMS.

Database Write Throughput Read Pattern Use Case
Cassandra / ScyllaDB Very high (100K+ writes/s) Partition key scan Primary message storage
PostgreSQL Medium Flexible (index, join) User profiles, settings
Redis Extremely high (in-memory) Key-value, sorted set Session, presence, recent messages cache
S3 / R2 High Object key Media files (images, videos, files)

Cassandra Schema Design

-- Message table: partition by conversation, cluster by time
CREATE TABLE messages (
    conversation_id UUID,
    message_id BIGINT,       -- Snowflake ID (time-sortable)
    sender_id UUID,
    message_type TEXT,        -- 'text', 'image', 'video', 'file'
    content TEXT,
    media_url TEXT,
    status TEXT,              -- 'sent', 'delivered', 'read'
    created_at TIMESTAMP,
    PRIMARY KEY (conversation_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

-- User conversation inbox
CREATE TABLE user_conversations (
    user_id UUID,
    last_message_at TIMESTAMP,
    conversation_id UUID,
    conversation_type TEXT,   -- '1:1' or 'group'
    last_message_preview TEXT,
    unread_count INT,
    PRIMARY KEY (user_id, last_message_at, conversation_id)
) WITH CLUSTERING ORDER BY (last_message_at DESC);

Why Snowflake ID Instead of UUID?

Snowflake ID (64-bit) consists of: 41 bits timestamp + 10 bits machine ID + 12 bits sequence number. Advantages: time-sortable (no additional timestamp column needed for sorting), smaller than UUID (8 bytes vs 16 bytes), and extremely fast generation without database roundtrips. Discord, Twitter, and Instagram all use variants of Snowflake ID.

Hot/Cold Storage Tiering

Not all messages are accessed frequently. Tiering strategy:

  • Hot (0-7 days): Redis sorted set — ultra-fast queries, store last 50 messages per conversation
  • Warm (7-90 days): Cassandra/ScyllaDB — primary storage, SSD-backed
  • Cold (90+ days): Object storage (S3) as compressed Parquet — minimal cost, queryable via Spark/Presto when needed

Presence System — "Online", "Last seen", "Typing..."

The presence system tracks real-time user status: online, offline, last seen, typing. This seems simple but is very hard to scale because each status change must be broadcast to all of that user's contacts.

stateDiagram-v2
    [*] --> Online: WebSocket connected
    Online --> Away: Inactive > 5 minutes
    Away --> Online: User interaction
    Online --> Offline: Disconnect / Heartbeat timeout
    Away --> Offline: Heartbeat timeout
    Offline --> Online: Reconnect

    Online --> Typing: Start typing
    Typing --> Online: Stop typing > 3s
    Typing --> Online: Send message

Presence System state machine

Scaling Presence

Naive approach: broadcast to all friends whenever a user changes status. If a user has 500 online friends → 500 WebSocket pushes. 1 million users changing status/minute → 500 million pushes/minute. Not feasible.

Smart approach — Lazy Presence:

  1. Store status in Redis with TTL (key: presence:{userId}, value: online, TTL: 60s)
  2. Heartbeat every 30s refreshes TTL — when TTL expires, automatically transitions to offline
  3. Only query presence when a user opens a conversation, don't broadcast to entire contact list
  4. Subscribe to presence changes of visible contacts via Redis Pub/Sub channels

WhatsApp Presence: "Last seen"

WhatsApp doesn't continuously broadcast online status. Instead, when user A opens a chat with B, the app queries B's "last seen". This is a pull-based design that saves significant bandwidth compared to push-based approaches. The "Typing..." indicator is only sent to users currently viewing that conversation.

Push Notifications for Offline Users

When the receiver is offline, the message is persisted to the database and triggers a push notification via FCM (Android) or APNs (iOS). Key design considerations:

  • Batching: Combine multiple messages into a single notification if they arrive within a short window ("You have 5 new messages from Anh Tu")
  • Priority: 1:1 messages → high priority (display immediately), group chat → normal priority (may be delayed by OS)
  • Unread sync: When the user reopens the app, Sync Service pulls all unread messages from the database, not relying on push notifications
  • Deduplication: Client uses messageId to dedup — if already received via WebSocket, ignore the notification

End-to-End Encryption (E2EE)

E2EE ensures only the sender and receiver can read message content — the server only sees ciphertext. WhatsApp and Signal use the Signal Protocol (Double Ratchet Algorithm + X3DH key agreement).

sequenceDiagram
    participant A as Alice
    participant S as Server
    participant B as Bob

    Note over A,B: Key Exchange (X3DH)
    A->>S: Register Identity Key + Signed Pre-Key + One-time Pre-Keys
    B->>S: Register Identity Key + Signed Pre-Key + One-time Pre-Keys

    A->>S: Request Bob's Pre-Key Bundle
    S-->>A: Bob's keys (Identity + Signed Pre + One-time)

    Note over A: Compute Shared Secret via
X3DH key agreement A->>S: Encrypted message (ciphertext) S->>B: Forward ciphertext Note over B: Compute Shared Secret
→ Decrypt message Note over A,B: Double Ratchet: each message
uses a different key

End-to-End Encryption flow with Signal Protocol

Why Double Ratchet?

Forward secrecy: If the current key is compromised, attackers cannot decrypt past messages because each message uses its own key derived from the ratchet chain. Post-compromise security: After a key is compromised, the system self-heals by generating new keys through the DH ratchet on the next message.

Scaling Strategies

WebSocket Gateway Scaling

The WebSocket gateway is stateful (holds connections), so it can't be scaled as simply as stateless HTTP services. Strategies:

  • Horizontal scaling: Add servers, use consistent hashing to distribute users. When adding/removing servers, only ~1/N users need to reconnect
  • Connection draining: When shutting down a server, notify clients to reconnect gradually (graceful) instead of abrupt disconnection
  • Regional deployment: Deploy WebSocket gateways across multiple regions, route users to the nearest server using GeoDNS

Message Database Sharding

graph LR
    subgraph Shard["Sharding by conversation_id"]
        M[Message] --> H{Hash
conversation_id} H -->|hash % 4 = 0| S0[Shard 0] H -->|hash % 4 = 1| S1[Shard 1] H -->|hash % 4 = 2| S2[Shard 2] H -->|hash % 4 = 3| S3[Shard 3] end style H fill:#e94560,stroke:#fff,color:#fff style S0 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style S1 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style S2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50 style S3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

Message database sharding by conversation_id

The shard key is conversation_id rather than user_id because: all messages in the same conversation reside on 1 shard → querying "get the latest 50 messages of conversation X" only hits 1 shard, no scatter-gather needed.

Component Technology Rationale
WebSocket Gateway .NET 10 (Kestrel) / Go Kestrel handles hundreds of thousands of concurrent connections with low memory footprint
Message Queue Apache Kafka High throughput, partition-based ordering, replay capability
Message Store ScyllaDB (Cassandra-compatible) Extremely high write throughput, linear scalability, familiar CQL
Session/Presence Cache Redis Cluster Sub-millisecond latency, Pub/Sub for presence, automatic TTL cleanup
User/Group Metadata PostgreSQL ACID compliance for user profiles, group settings, permissions
Media Storage S3 / Cloudflare R2 S3-compatible, R2 has zero egress fees — ideal for chat images/videos
Search Elasticsearch / Meilisearch Full-text search across message history
Push Notifications FCM + APNs Platform-native, reliable, free

Monitoring & Observability

For real-time systems, monitoring is mission-critical. Key metrics to track:

  • Message delivery latency (P50, P95, P99): Time from sender sending to receiver receiving — target P99 < 200ms
  • WebSocket connection count per server: Alert when exceeding 80% capacity
  • Message queue lag: Consumer lag on Kafka — continuously increasing lag signals a bottleneck
  • Failed delivery rate: Percentage of messages that fail to deliver after 3 retries
  • Database write latency: P99 write latency of Cassandra — target < 10ms

Distributed Tracing for Message Flow

Attach a traceId to each message from the moment the sender sends it, propagating through WebSocket Gateway → Kafka → Consumer → Database → Receiver. Use OpenTelemetry + Jaeger/Tempo to trace the entire message journey — when a user reports "slow message delivery", you can pinpoint exactly where the bottleneck is.

Real-world Lessons from WhatsApp and Telegram

WhatsApp — Erlang/FreeBSD
WhatsApp is famous for having only 50 engineers serve 2 billion users. The secret: Erlang VM (BEAM), designed for telecom — handling millions of lightweight processes, hot code reload, fault-tolerant. Each WhatsApp server handles 2-3 million connections.
Telegram — MTProto + Bare-metal
Telegram developed the MTProto protocol instead of using standard TLS, optimized for mobile (fewer round-trips, smaller packets). Infrastructure runs on bare-metal servers, not cloud — reducing latency and costs at large scale.
Discord — Elixir + Rust
Discord started with Elixir (BEAM VM, like Erlang) for the WebSocket gateway, then rewrote hot paths in Rust to optimize memory. A single Discord server handles up to 1 million concurrent connections.
Slack — Java + WebSocket
Slack uses Java for the main backend, WebSocket for real-time messaging, and MySQL for persistence (sharded). As they scaled, Slack encountered MySQL sharding issues and gradually migrated to Vitess (MySQL proxy).

Conclusion

Designing a large-scale real-time Chat system requires careful consideration at multiple layers: from WebSocket connection management with consistent hashing and heartbeat, message routing with appropriate fan-out strategies, database design optimized for write-heavy workloads, to bandwidth-efficient presence systems and privacy-preserving E2E encryption.

There's no "one-size-fits-all" solution — WhatsApp chose Erlang and fan-out on write because of small groups, Telegram chose MTProto and hybrid fan-out for large supergroups, Discord rewrote in Rust due to memory pressure. The common thread: understanding the tradeoffs in every architectural decision, designing for failure (network partitions, server crashes), and always measuring with real metrics rather than guessing.

References: