Overview Beginner 6 min read

System Design .NET — Wrap-Up: 5 Lessons That Survive

Q: What should I read after this series?

Three books and one site. *Designing Data-Intensive Applications* by Martin Kleppmann is the canonical follow-up - covers consistency, replication, and stream processing in much more depth. *Site Reliability Engineering* (Google, free online) covers the operational half. *Release It!* by Michael Nygard covers failure modes. Plus highscalability.com for case studies of real production systems.

Wrap-up of System Design .NET A -> Z: five lessons that outlive specific case studies, the production .NET architecture checklist, and what to read next.

Phùng Anh Tú · May 30, 2026

Table of contents

Lesson 1 — what does estimating buy that drawing boxes does not?
Lesson 2 — what is the right default stack for a .NET service?
Lesson 3 — when should the slow work go through a queue?
Lesson 4 — how do I design for partial failure?
Lesson 5 — what is observability really for?
How do the five lessons compose into one architecture?
What is the production .NET architecture checklist?
What should you read after this series?
What should you do tomorrow?
What if my problem doesn't fit any case study?
Where should you go from here?

You started the series wanting a repeatable way to answer "how would you design X". You finished it, hopefully, with the realisation that "how" matters less than "with what numbers". The final chapter is short by design - five lessons, a checklist, and a list of what to read next. The series stays; the conversation about your codebase begins.

Lesson 1 — what does estimating buy that drawing boxes does not?

Every architecture decision in this series traces back to a number from chapter 2. The cache exists because of read amplification; the queue exists because of long-running side effects; the shard exists because writes exceed one Postgres node. Without the numbers, every box on the diagram is a guess.

The implication is sharp: the candidate or engineer who skips estimation cannot defend their architecture. They can describe it, they can implement it, but the moment somebody asks "why" they have nothing. Numbers are the answer. Practise them on every system you encounter, even when no one is asking.

Lesson 2 — what is the right default stack for a .NET service?

After twenty-six chapters, the boring answer is correct most of the time:

PostgreSQL + EF Core for storage.
Redis for cache and shared counters.
RabbitMQ or Azure Service Bus for queues.
MediatR for in-process pub/sub.
Polly / Microsoft.Extensions.Http.Resilience for retries and circuit breakers.
OpenTelemetry for traces, metrics, and logs.
ASP.NET Core minimal API for HTTP, gRPC for service-to-service inside the cluster.

Reach for exotic alternatives

Cassandra, Kafka, Elasticsearch - only when the numbers force the choice. The most common failure I see in code reviews is somebody adopting Kafka for a 10 RPS service. The default stack scales much further than people expect.

Lesson 3 — when should the slow work go through a queue?

Three signs the answer is "yes":

The work takes longer than the user can wait synchronously.
The work has external side effects that need at-least-once delivery.
The work load is bursty and the receiver cannot absorb the burst.

If any of these hold, the message queue chapter applies. Pair the queue with idempotent consumers (chapter 10) and you have a service that survives every transient failure mode.

The corollary: synchronous fire-and-forget HTTP calls are almost always wrong. They give you neither retry nor durability nor back-pressure. The number of production incidents caused by "controller calls third-party in the request path" is depressingly high.

Lesson 4 — how do I design for partial failure?

Every component in your architecture will fail at some point. The question is which of the four reliability patterns applies:

// The four resilience defaults for a .NET service:
builder.Services.AddHttpClient<IExternalApi>()
    .AddStandardResilienceHandler();  // Polly: retry + circuit breaker + timeout

builder.Services.AddMassTransit(x =>
{
    x.AddSagaStateMachine<OrderSaga, OrderSagaState>()
       .EntityFrameworkRepository();   // Saga + outbox + persistence
    x.UsingRabbitMq();
});

// Plus: idempotency middleware on every state-changing endpoint
// (chapter 10), and OpenTelemetry to see what is happening
// (chapter 13).

Four lines of registration, four production-ready patterns. Most .NET services that go down at 3 AM are missing one of these. The fix is rarely heroic engineering; it is the boring discipline of wiring the patterns from day one.

Lesson 5 — what is observability really for?

Three things, in priority order:

Symptom alerts - tell oncall when users are unhappy.
Trace navigation - explain why one request was slow.
Capacity planning - the numbers behind your back-of-envelope become real numbers in dashboards.

Without OpenTelemetry, you have hope. With it, you have a method. Every architectural decision in the series is testable against the metrics it produces; if you cannot measure it, you cannot defend it.

How do the five lessons compose into one architecture?

The lessons are not independent - they layer into the shape every production .NET service eventually adopts:

flowchart LR
    Estimate[1. Estimate before architect] --> Stack[2. Default stack:<br/>PG + Redis + RabbitMQ]
    Stack --> Async[3. Queue the slow work]
    Async --> Reliab[4. Idempotency + outbox<br/>+ resilience handlers]
    Reliab --> Obs[5. OpenTelemetry<br/>traces + metrics + logs]
    Obs --> Iterate[Iterate on numbers]
    Iterate --> Estimate

The loop closes because observability feeds the next round of estimation: real metrics replace guesses, the architecture adjusts, the cycle continues. A service that runs this loop for a few years ends up genuinely well-designed - not because of any one heroic choice, but because of the steady discipline.

What is the production .NET architecture checklist?

Cut this out and pin it to your wall:

Before shipping a new service, verify:

[ ] Estimated peak QPS, storage 1 yr, latency p99 budget, on paper.
[ ] PostgreSQL with read replica wired in EF Core.
[ ] Redis cache with explicit TTL and IDistributedCache.
[ ] At least one async path through a queue (RabbitMQ / Service Bus).
[ ] Idempotency middleware on every state-changing endpoint.
[ ] Outbox pattern for write + publish operations.
[ ] HttpClient resilience handler on every external call.
[ ] OpenTelemetry tracing + metrics, exported to a backend.
[ ] Rate limiter middleware on public endpoints.
[ ] Auth via cookie or JWT, never localStorage.
[ ] Sitemap, observability dashboards, and runbooks for top alerts.
[ ] Load test at 2x estimated peak before launch.

Twelve items. Most production incidents I have debugged trace back to a missing checkbox. The list is the boring part of system design; mastering it is what separates "works on my machine" from "works on Black Friday".

What should you read after this series?

Martin Kleppmann, Designing Data-Intensive Applications - the canonical sequel. Read it once, then re-read it.
Google, Site Reliability Engineering (free online) - the operational half of system design. SLO/SLI, error budgets, incident response.
Michael Nygard, Release It! - the failure modes book. Every pattern in the reliability chapters is in here in more depth.
highscalability.com - case studies of real production architectures (Twitter, Discord, Stack Overflow). Use them to practise the framework from chapter 24.
Microsoft Architecture Center - learn.microsoft.com/azure/architecture
- .NET-specific reference architectures with concrete code.

What should you do tomorrow?

Five small habits that turn the series into permanent skill:

Quote a number whenever you propose a component. "We need Kafka because... 10K events/sec" not "we need Kafka because it's good for events".
Add observability before features. Wire OpenTelemetry on day one of any new service. The observability chapter takes 30 minutes to wire.
Question the queue. Every time someone proposes a queue, ask "what failure mode does this prevent?". Sometimes the answer is none and a synchronous call is right.
Review PRs through the system-design lens. "Where is the idempotency on this endpoint?" "What happens if Redis is down?" The questions push the team toward the patterns.
Re-read one chapter per week for six weeks. Recognition matters more than recall; spaced re-reading beats one heroic pass.

What if my problem doesn't fit any case study?

You will, occasionally, design something that does not match any of the nine. That is fine. The case studies are common shapes, not the only shapes. When your problem is genuinely new, the right move is to describe it well - estimate the load, name the consistency model, identify the failure modes - using the vocabulary from chapter 1. That is how new patterns enter the conversation in the first place.

Where should you go from here?

Series start: introduction.
Interview prep: how to answer.
Pick the case study that matches your current project: URL shortener, news feed, payment, or any of the other six.

Thank you for reading the series. The vocabulary is yours to keep; the architectures are yours to remix. Now go open one of those production codebases and find a pattern you recognise.

Frequently asked questions

What single skill is hardest and most important?

Restraint. Most production .NET incidents come from over-engineered systems where someone reached for Kafka, Cassandra, or microservices before the QPS estimate justified it. The chapter you re-read most is back-of-envelope because it is the discipline that says no to premature complexity.

What should I read after this series?

Three books and one site. Designing Data-Intensive Applications by Martin Kleppmann is the canonical follow-up - covers consistency, replication, and stream processing in much more depth. Site Reliability Engineering (Google, free online) covers the operational half. Release It! by Michael Nygard covers failure modes. Plus highscalability.com for case studies of real production systems.

Are there design topics this series missed?

Several worth mentioning: dedicated chapters on CDN strategy, multi-region active-active, ML serving infrastructure, time-series databases, and security architecture. The principles transfer; the implementation details differ. Treat the series as an alphabet, not a dictionary.

What is the single biggest takeaway?

System design is decision-making under uncertainty. The numbers from chapter 2, the consistency model from chapter 3, the patterns from chapters 10-12 - all of them are tools for justifying a choice. The choice itself is rarely the hard part; defending it under pressure is. The series teaches the vocabulary that lets you do both.