Case Studies Advanced 5 min read

Design a Realtime Chat with SignalR in .NET

Q: How is presence tracked?

Each user's connection writes a Redis key `presence:user:{id}` with a 30 s TTL on heartbeat. To list online friends, do a `MGET` of expected keys; missing = offline. Cheap, no fan-out cost, and self-correcting on disconnect (the TTL expires).

How to build a realtime chat service in .NET with SignalR: WebSocket fan-out, Redis backplane for scale-out, presence tracking, and message ordering across rooms.

Phùng Anh Tú · May 23, 2026

Table of contents

When does the chat case study come up?
What back-of-envelope numbers shape the design?
What does the architecture look like?
What is the .NET 10 wiring with SignalR + Redis backplane?
How is message ordering guaranteed?
What scale-out path does this support?
What failure modes does this introduce?
When is SignalR the wrong shape?
Where should you go from here?

A chat system is the case study where every block in the series shows up: WebSockets for realtime, fan-out across replicas, durable storage for history, presence tracking, and message ordering. This chapter designs one in .NET with SignalR, then wires the production shape that scales to 100K concurrent users.

When does the chat case study come up?

Three contexts. Customer support chat inside SaaS products. Team messaging (Slack, Teams clones). Game lobbies and realtime collaboration. The architectural ideas are the same; the volume and durability requirements differ.

The interview version usually frames it as "design WhatsApp at scale". The production version is usually "add chat to our existing app without melting the database".

What back-of-envelope numbers shape the design?

Concurrent users     100K
Avg messages/user/h  20
Peak msgs/sec        100K * 20 / 3600 * 5 = 2,800
Avg message size     500 bytes (text + meta)
Storage / day        100K * 100 msg * 500 B = 5 GB/day
Connections          100K WebSockets
Memory / connection  ~30 KB SignalR + Kestrel = 3 GB total

100K WebSockets fit comfortably on 4 ASP.NET Core instances behind a sticky-session load balancer. Postgres handles 2,800 writes/sec without tuning. The interesting work is the fan-out across replicas, which is what the Redis backplane does.

What does the architecture look like?

flowchart LR
    Client1[Browser/Mobile] -.WebSocket.-> LB[Sticky LB]
    Client2[Browser/Mobile] -.WebSocket.-> LB
    LB --> H1[SignalR Hub 1]
    LB --> H2[SignalR Hub 2]
    LB --> H3[SignalR Hub 3]
    H1 -.pub/sub.-> Redis[(Redis Backplane)]
    H2 -.pub/sub.-> Redis
    H3 -.pub/sub.-> Redis
    H1 --> PG[(Postgres<br/>messages)]
    H1 --> Pres[(Redis<br/>presence)]
    H1 -. publish event .-> Q[(Queue<br/>notifications)]

Sticky load balancing keeps a connection on one hub instance. Messages sent to a room are persisted to Postgres, broadcast via the Redis backplane to all hubs, and forwarded to local connections on each hub. Presence keys live in Redis with TTL. Notification events go through a queue to email/push.

What is the .NET 10 wiring with SignalR + Redis backplane?

// Program.cs
builder.Services.AddSignalR()
    .AddStackExchangeRedis(builder.Configuration.GetConnectionString("Redis")!,
        opt => opt.Configuration.ChannelPrefix = RedisChannel.Literal("chat"));

app.MapHub<ChatHub>("/hubs/chat");

// Hub
public class ChatHub(AppDbContext db, IConnectionMultiplexer redis) : Hub
{
    public override async Task OnConnectedAsync()
    {
        var userId = Context.User!.GetUserId();
        await redis.GetDatabase().StringSetAsync($"presence:user:{userId}",
            Context.ConnectionId, TimeSpan.FromSeconds(30));
        await base.OnConnectedAsync();
    }

    public async Task JoinRoom(Guid roomId)
        => await Groups.AddToGroupAsync(Context.ConnectionId, $"room:{roomId}");

    public async Task SendMessage(Guid roomId, string text)
    {
        var msg = new Message
        {
            Id = Guid.NewGuid(),
            RoomId = roomId,
            UserId = Context.User!.GetUserId(),
            Text = text,
            CreatedAt = DateTimeOffset.UtcNow,
            Sequence = await NextSequenceAsync(roomId)   // monotonic per room
        };
        db.Messages.Add(msg);
        await db.SaveChangesAsync();

        // Cache the recent message
        await redis.GetDatabase().SortedSetAddAsync(
            $"room:{roomId}:recent", JsonSerializer.Serialize(msg), msg.Sequence);

        // Broadcast - the backplane fans out across replicas
        await Clients.Group($"room:{roomId}").SendAsync("message", msg);
    }

    public async Task Heartbeat()
    {
        var userId = Context.User!.GetUserId();
        await redis.GetDatabase().KeyExpireAsync(
            $"presence:user:{userId}", TimeSpan.FromSeconds(30));
    }
}

Three details. The Redis backplane is one extension method - no custom fan-out code. The Sequence per room gives monotonic ordering even when two messages arrive at different hubs within the same millisecond. Presence is just a TTL key; if the user disconnects, the key expires.

How is message ordering guaranteed?

sequenceDiagram
    participant A as User A (Hub1)
    participant H1 as Hub 1
    participant DB as Postgres
    participant H2 as Hub 2
    participant B as User B (Hub2)
    A->>H1: SendMessage(room, "hello")
    H1->>DB: INSERT Sequence=42
    H1->>H1: Broadcast room
    H1-->>H2: Backplane pub/sub Sequence=42
    H2->>B: deliver Sequence=42
    Note over A,B: All clients see Sequence 42 ordered globally per room.

The sequence comes from a per-room counter (Postgres sequence or Redis INCR). Clients order their local view by sequence number, so late-arriving messages reorder correctly. Without this, two messages sent within milliseconds at different hubs can appear out of order on different clients.

What scale-out path does this support?

Hub instances: scale horizontally with the backplane; sticky sessions via the LB.
Backplane: Redis Cluster shards channels by room ID hash; one channel per room reduces broadcast cost.
Storage: partition messages table by room ID + month; reads hit one partition.
Presence: Redis cluster, one shard per user ID hash.

For >1M concurrent connections, replace SignalR with a pure WebSocket gateway and split the application logic into separate microservices. Up to that point, SignalR scales fine.

What failure modes does this introduce?

Sticky session fail-over - the LB drops a hub; clients reconnect to a new hub. Mitigation: SignalR's automatic reconnect
- replay last N messages from the recent-messages cache.
Backplane outage - Redis pub/sub dies; hubs cannot fan out across instances. Mitigation: the resilience handlers from chapter 11 apply; degrade to single-hub mode if the backplane is unhealthy.
Message storm in one room - 100K users in one room, each sending a message a second. Mitigation: rate-limit per user per room; backpressure when the broadcast queue gets long.
Presence drift - user closes laptop, presence lingers until TTL. Mitigation: 15-30 s TTL keeps drift small; clients heartbeat every 10 s.

When is SignalR the wrong shape?

When you need very low end-to-end latency (under 50 ms), at very high concurrent connections (>1M), or you require client SDKs in many languages (SignalR's clients are .NET, JS, Java, Python, Swift). Beyond that, raw WebSockets behind Nginx + a Redis pub/sub fabric works for any language and gives more control - at the cost of writing the lifecycle handling yourself.

Where should you go from here?

Next case study: notification system - the fan-out from chat extends to multi-channel (email, SMS, push) in the next chapter. Many of the same patterns (queue, idempotency, preference store) carry over directly.

Frequently asked questions

Why SignalR over raw WebSockets?

SignalR handles the four hard problems: connection lifecycle (reconnect, heartbeat), transport fallback (WebSocket → Server-Sent Events → long polling), grouping (rooms), and the Redis backplane for fan-out across replicas. You can write all of that yourself, but for .NET teams the off-the-shelf library saves weeks of edge-case fixes.

How does the Redis backplane work?

When a SignalR hub on instance A sends a message to a group, the message is published to a Redis pub/sub channel. Every other instance subscribed to that channel receives the message and forwards it to the connections it owns in that group. The backplane decouples 'which instance owns the connection' from 'who needs the message'.

Where do messages live?

Two stores. Postgres holds the durable transcript (one row per message, indexed by room and timestamp). The cache holds the recent 1000 messages per room (Redis sorted set) for fast pagination. Outgoing fan-out uses the backplane; storage and history pagination read from Postgres + cache.

How is presence tracked?

Each user's connection writes a Redis key presence:user:{id} with a 30 s TTL on heartbeat. To list online friends, do a MGET of expected keys; missing = offline. Cheap, no fan-out cost, and self-correcting on disconnect (the TTL expires).