Design a Realtime Chat with SignalR in .NET
How to build a realtime chat service in .NET with SignalR: WebSocket fan-out, Redis backplane for scale-out, presence tracking, and message ordering across rooms.
Table of contents
- When does the chat case study come up?
- What back-of-envelope numbers shape the design?
- What does the architecture look like?
- What is the .NET 10 wiring with SignalR + Redis backplane?
- How is message ordering guaranteed?
- What scale-out path does this support?
- What failure modes does this introduce?
- When is SignalR the wrong shape?
- Where should you go from here?
A chat system is the case study where every block in the series shows up: WebSockets for realtime, fan-out across replicas, durable storage for history, presence tracking, and message ordering. This chapter designs one in .NET with SignalR, then wires the production shape that scales to 100K concurrent users.
When does the chat case study come up?
Three contexts. Customer support chat inside SaaS products. Team messaging (Slack, Teams clones). Game lobbies and realtime collaboration. The architectural ideas are the same; the volume and durability requirements differ.
The interview version usually frames it as "design WhatsApp at scale". The production version is usually "add chat to our existing app without melting the database".
What back-of-envelope numbers shape the design?
Concurrent users 100K
Avg messages/user/h 20
Peak msgs/sec 100K * 20 / 3600 * 5 = 2,800
Avg message size 500 bytes (text + meta)
Storage / day 100K * 100 msg * 500 B = 5 GB/day
Connections 100K WebSockets
Memory / connection ~30 KB SignalR + Kestrel = 3 GB total
100K WebSockets fit comfortably on 4 ASP.NET Core instances behind a sticky-session load balancer. Postgres handles 2,800 writes/sec without tuning. The interesting work is the fan-out across replicas, which is what the Redis backplane does.
What does the architecture look like?
flowchart LR
Client1[Browser/Mobile] -.WebSocket.-> LB[Sticky LB]
Client2[Browser/Mobile] -.WebSocket.-> LB
LB --> H1[SignalR Hub 1]
LB --> H2[SignalR Hub 2]
LB --> H3[SignalR Hub 3]
H1 -.pub/sub.-> Redis[(Redis Backplane)]
H2 -.pub/sub.-> Redis
H3 -.pub/sub.-> Redis
H1 --> PG[(Postgres<br/>messages)]
H1 --> Pres[(Redis<br/>presence)]
H1 -. publish event .-> Q[(Queue<br/>notifications)]
Sticky load balancing keeps a connection on one hub instance. Messages sent to a room are persisted to Postgres, broadcast via the Redis backplane to all hubs, and forwarded to local connections on each hub. Presence keys live in Redis with TTL. Notification events go through a queue to email/push.
What is the .NET 10 wiring with SignalR + Redis backplane?
// Program.cs
builder.Services.AddSignalR()
.AddStackExchangeRedis(builder.Configuration.GetConnectionString("Redis")!,
opt => opt.Configuration.ChannelPrefix = RedisChannel.Literal("chat"));
app.MapHub<ChatHub>("/hubs/chat");
// Hub
public class ChatHub(AppDbContext db, IConnectionMultiplexer redis) : Hub
{
public override async Task OnConnectedAsync()
{
var userId = Context.User!.GetUserId();
await redis.GetDatabase().StringSetAsync($"presence:user:{userId}",
Context.ConnectionId, TimeSpan.FromSeconds(30));
await base.OnConnectedAsync();
}
public async Task JoinRoom(Guid roomId)
=> await Groups.AddToGroupAsync(Context.ConnectionId, $"room:{roomId}");
public async Task SendMessage(Guid roomId, string text)
{
var msg = new Message
{
Id = Guid.NewGuid(),
RoomId = roomId,
UserId = Context.User!.GetUserId(),
Text = text,
CreatedAt = DateTimeOffset.UtcNow,
Sequence = await NextSequenceAsync(roomId) // monotonic per room
};
db.Messages.Add(msg);
await db.SaveChangesAsync();
// Cache the recent message
await redis.GetDatabase().SortedSetAddAsync(
$"room:{roomId}:recent", JsonSerializer.Serialize(msg), msg.Sequence);
// Broadcast - the backplane fans out across replicas
await Clients.Group($"room:{roomId}").SendAsync("message", msg);
}
public async Task Heartbeat()
{
var userId = Context.User!.GetUserId();
await redis.GetDatabase().KeyExpireAsync(
$"presence:user:{userId}", TimeSpan.FromSeconds(30));
}
}
Three details. The Redis backplane is one extension method - no
custom fan-out code. The Sequence per room gives monotonic
ordering even when two messages arrive at different hubs within the
same millisecond. Presence is just a TTL key; if the user
disconnects, the key expires.
How is message ordering guaranteed?
sequenceDiagram
participant A as User A (Hub1)
participant H1 as Hub 1
participant DB as Postgres
participant H2 as Hub 2
participant B as User B (Hub2)
A->>H1: SendMessage(room, "hello")
H1->>DB: INSERT Sequence=42
H1->>H1: Broadcast room
H1-->>H2: Backplane pub/sub Sequence=42
H2->>B: deliver Sequence=42
Note over A,B: All clients see Sequence 42 ordered globally per room.
The sequence comes from a per-room counter (Postgres sequence or Redis INCR). Clients order their local view by sequence number, so late-arriving messages reorder correctly. Without this, two messages sent within milliseconds at different hubs can appear out of order on different clients.
What scale-out path does this support?
- Hub instances: scale horizontally with the backplane; sticky sessions via the LB.
- Backplane: Redis Cluster shards channels by room ID hash; one channel per room reduces broadcast cost.
- Storage: partition messages table by room ID + month; reads hit one partition.
- Presence: Redis cluster, one shard per user ID hash.
For >1M concurrent connections, replace SignalR with a pure WebSocket gateway and split the application logic into separate microservices. Up to that point, SignalR scales fine.
What failure modes does this introduce?
- Sticky session fail-over - the LB drops a hub; clients
reconnect to a new hub. Mitigation: SignalR's automatic reconnect
- replay last N messages from the recent-messages cache.
- Backplane outage - Redis pub/sub dies; hubs cannot fan out across instances. Mitigation: the resilience handlers from chapter 11 apply; degrade to single-hub mode if the backplane is unhealthy.
- Message storm in one room - 100K users in one room, each sending a message a second. Mitigation: rate-limit per user per room; backpressure when the broadcast queue gets long.
- Presence drift - user closes laptop, presence lingers until TTL. Mitigation: 15-30 s TTL keeps drift small; clients heartbeat every 10 s.
When is SignalR the wrong shape?
When you need very low end-to-end latency (under 50 ms), at very high concurrent connections (>1M), or you require client SDKs in many languages (SignalR's clients are .NET, JS, Java, Python, Swift). Beyond that, raw WebSockets behind Nginx + a Redis pub/sub fabric works for any language and gives more control - at the cost of writing the lifecycle handling yourself.
Where should you go from here?
Next case study: notification system - the fan-out from chat extends to multi-channel (email, SMS, push) in the next chapter. Many of the same patterns (queue, idempotency, preference store) carry over directly.
Frequently asked questions
Why SignalR over raw WebSockets?
How does the Redis backplane work?
Where do messages live?
How is presence tracked?
presence:user:{id} with a 30 s TTL on heartbeat. To list online friends, do a MGET of expected keys; missing = offline. Cheap, no fan-out cost, and self-correcting on disconnect (the TTL expires).