How to Answer System Design Interviews — A Framework
A repeatable six-step framework for answering system design interviews in 45 minutes: clarify, estimate, architect, deep-dive, optimise, recap.
Table of contents
- Why does the framework matter more than the answer?
- What is the six-step framework?
- How do I clarify the scope in five minutes?
- How do I estimate the load in five minutes?
- How do I draw the high-level architecture in ten minutes?
- How do I deep-dive on one component in fifteen minutes?
- How do I handle "what if 10x traffic" in five minutes?
- How should I recap in the final five minutes?
- What does practice look like?
- When does the framework not apply?
- Where should you go from here?
The nine case studies in this series are all the same shape: from "design X" to a working architecture in 45 minutes. The shape is a framework, and the framework can be practised. This chapter pulls the case studies into one repeatable interview answer.
Why does the framework matter more than the answer?
Three reasons.
Ambiguity. "Design Twitter" has no right answer. Different interviewers have different ideal architectures. Showing process
- asking, estimating, justifying - is the only way to demonstrate seniority because there is no ground truth.
Time. Forty-five minutes goes fast. Without a structure, candidates spend 30 minutes drawing boxes and never reach the deep-dive that distinguishes mid from senior.
Pressure. The interviewer will throw curveballs. A framework gives you somewhere to land when the curveball lands.
What is the six-step framework?
flowchart LR
Clarify[1. Clarify scope<br/>5 min] --> Estimate[2. Estimate load<br/>5 min]
Estimate --> Arch[3. High-level architecture<br/>10 min]
Arch --> Deep[4. Deep-dive on one component<br/>15 min]
Deep --> Scale[5. Scale-out and failure modes<br/>5 min]
Scale --> Recap[6. Recap<br/>5 min]
Six steps, ~45 minutes total, with cushion for questions. Most candidates skip 1 and 2; that is where the most marks live.
How do I clarify the scope in five minutes?
Three questions to ask, every time:
- Who are the users and what is the geography? Drives traffic patterns and the multi-region question.
- What features are in scope? "Twitter" includes timeline, posting, search, DMs, ads. Pick three; defer the rest.
- What are the success criteria? Latency p99 target, availability SLO, consistency requirements. The chapter 1 vocabulary gives you the words.
A good interviewer will already have written down preferred answers; your asking the question lets them point you at them.
How do I estimate the load in five minutes?
Use the constants from chapter 2 - 100K seconds per day, 1 KB per text row, 5x peak factor. Compute four numbers on the board:
Numbers I will reference for the rest of the interview:
- DAU ~100M
- Peak QPS ~50K (read), ~5K (write) after 5x factor
- Storage 1 yr ~36 TB at 1 KB per write
- Latency p99 < 100 ms (interactive feature)
Write them in a column the interviewer can see. Reference them in the rest of the discussion - "5K writes/s fits a single Postgres node, so we don't need to shard yet". This is the move that distinguishes structured candidates from improvisers.
How do I draw the high-level architecture in ten minutes?
Start with the simplest single-node version, then add only the boxes the load numbers force. The pattern is repeatable across the case studies:
flowchart LR
Client --> LB[Load Balancer]
LB --> App[App service]
App --> Cache[(Cache)]
App --> DB[(Primary DB)]
App --> Q[(Queue)]
Q --> Worker[Async worker]
Worker --> External[External services]
App --> Obs[(Observability)]
Six boxes is enough for almost any case study. Justify each by referring to a specific number from your estimate. The cache is there because of the read amplification (90% hit rate); the queue is there because of the long-running side effect; the DB is Postgres because the write QPS fits one node.
How do I deep-dive on one component in fifteen minutes?
The interviewer will point at a box and say "tell me more about that". This is where 60% of the score lives. Three layers to walk through:
- Schema or interface - what does the data look like, what does the API look like.
- Algorithm or pattern - how does cache invalidation work, how does the queue retry, how does the saga compensate. The reliability chapters are the source for these.
- Failure mode - what happens when this component dies, how do you detect it, how do you recover.
Practise the deep-dive on every case study in this series. The URL shortener (chapter 15) deep-dives the cache; the news feed (chapter 17) deep-dives the fan-out worker; the payment system deep-dives the saga.
How do I handle "what if 10x traffic" in five minutes?
Take the numbers from step 2 and multiply. For each box in your architecture, say what changes:
// The "scale to 10x" mental checklist:
// - App tier: stateless, add replicas. NO change.
// - Cache: scale Redis cluster, partition keys by hash. NO change.
// - DB: 5K -> 50K writes/s exceeds single Postgres. ADD: sharding by user_id.
// - Queue: 1K -> 10K msg/s still fits RabbitMQ. NO change.
// - Worker: parallelise via consumer groups. NO change.
// - Observability: cardinality may explode. ADD: drop high-cardinality labels.
Knowing where the bottleneck is matters more than knowing how to fix it. The interviewer wants to see you can reason about which component breaks first.
How should I recap in the final five minutes?
Three sentences, on the board:
- "Here is what we are building, with these four numbers."
- "Here is the architecture, with the cache/queue/DB justified."
- "Here is what changes at 10x scale and the main failure modes."
Then ask "what would you like me to dig into deeper?" - this is the closing move that often turns a borderline interview into a hire.
What does practice look like?
Pick one case study from this series per practice session. Set a 45-minute timer. Cover the chapter; only look at it for the estimation numbers. Run through the six steps on a whiteboard (or paper). Compare your architecture to the chapter's. The gap narrows in two or three sessions.
The nine case studies cover the standard interview corpus:
- Read-heavy CRUD: URL shortener
- Write-heavy fan-out: news feed
- Realtime: chat
- Multi-channel: notification
- Large objects: file upload
- Search adjacency: typeahead
- Money / strict consistency: payment
- Streaming: analytics
- Algorithmic: rate limiter
If you can answer all nine in 45 minutes each, you can answer almost any interview question.
When does the framework not apply?
Two situations.
Whiteboard coding interviews. Different shape entirely - data structures and algorithms. The framework here is for system design specifically.
Domain-specific deep dives ("design our specific feature X"). The interviewer expects you to learn their domain, not run the generic framework. Ask many clarifying questions; rely on the fundamentals from chapter 3.
Where should you go from here?
Last chapter: conclusion - five lessons from the entire series and a checklist you can carry into your next architecture review or interview.