Back-of-Envelope Math for System Design
How to estimate QPS, storage, bandwidth, and latency budget on a whiteboard. The numbers every system design interview and capacity-planning exercise reuses.
Table of contents
- When does back-of-envelope arithmetic actually save you?
- What four numbers should every estimate produce?
- What round constants should I memorise?
- How does a 30-second estimate flow on a whiteboard?
- What does this look like in .NET capacity planning?
- Where do back-of-envelope estimates fail?
- When should you skip back-of-envelope and just measure?
- Where should you go from here?
A senior engineer in a system design interview is given the prompt "design Twitter". They reach for a marker, write a column of four numbers - 100M DAU, 50K QPS, 200 TB/year, p99 < 100 ms - and the rest of the conversation flows from there. That column is back-of-envelope math, and the discipline behind it is what this chapter teaches.
When does back-of-envelope arithmetic actually save you?
Three situations.
First, interview opening. The interviewer expects you to estimate load before drawing boxes. Skipping the estimate signals that you do not know which architecture is overkill. Two minutes of math earn the right to propose a "boring" Postgres-only design, or to defend a sharded Cassandra cluster.
Second, capacity planning at work. Before you spin up the production database tier, you owe the team an answer for "how big". Guessing wastes money up; underspecifying means a 3 AM incident. Math fits in a Slack thread.
Third, architecture review. Someone proposes Kafka for an event stream. Is it warranted? Estimate the QPS - if it is 10/s, no. If it is 100K/s, yes. The estimate ends the debate.
What four numbers should every estimate produce?
Every back-of-envelope exercise lands on the same four numbers, because they map to the same four cost lines on a cloud bill:
- QPS at peak (writes vs reads broken out) - drives compute sizing.
- Storage in TB at one year (and growth rate) - drives database tier and sharding decision.
- Bandwidth in GB/s in and out at peak - drives load balancer and egress cost.
- Latency budget in ms broken down per hop - drives architecture choice (cache vs DB, sync vs queue).
The QPS and storage numbers are the most important; bandwidth and latency confirm the design after.
What round constants should I memorise?
Memorise these once and never compute them again:
TIME
1 day ~ 100,000 seconds (actually 86,400, round up for headroom)
1 month ~ 30 days ~ 2.5M seconds
1 year ~ 30M seconds
DATA
1 small text row 1 KB (e.g. tweet, comment, log line)
1 thumbnail image 50 KB
1 photo 500 KB
1 short video 10 MB
1 row in users table 200 bytes
NETWORK
LAN round-trip 0.5 ms
Cross-AZ round-trip 1-2 ms
Cross-region 50-150 ms
Internet round-trip 10-200 ms
DISK / MEMORY
RAM read 100 ns
SSD random read 100 µs
SSD sequential 300 MB/s
HDD random read 10 ms
HDD sequential 100 MB/s
CPU SCALES
1 modern core 100K-1M simple ops/sec
1 ASP.NET Core box ~10K simple QPS, ~1K with EF Core query
1 Postgres node 5K-20K writes/sec, 30K-100K reads/sec
1 Redis node 100K-1M ops/sec
1 Kafka broker ~1M msg/sec at default settings
These are generous round numbers; real measurements are usually within 2x. The 100K seconds/day trick is the most useful: 1000 RPS becomes 100M req/day in your head with no calculator.
How does a 30-second estimate flow on a whiteboard?
Take Twitter as the canonical example. Interviewer says "100M daily active users, designs the timeline". The mental flow:
flowchart LR
DAU[100M DAU] --> QPS[Peak QPS<br/>= 100M / 100k s<br/>* 5 peak factor<br/>= 5K req/s reads<br/>= 500 req/s writes]
QPS --> Storage[Storage<br/>= 500 writes/s<br/>* 100k s/day<br/>* 1 KB<br/>* 365 days<br/>= 18 TB/year]
Storage --> BW[Bandwidth<br/>= 5K read/s<br/>* 50 KB tweet+meta<br/>= 250 MB/s out]
BW --> Lat[Latency<br/>p99 100 ms<br/>= 50ms cache<br/>+ 30ms net<br/>+ 20ms render]
That is the 30-second budget. The rest of the conversation - which database, sharding, caching - all reference back to those four numbers. If somebody proposes a single Postgres node, you can say "500 writes/s fits one box but 18 TB/year forces partitioning by year" and the design follows.
What does this look like in .NET capacity planning?
Translate to dollars by working backward from the numbers above. Suppose you are designing the URL shortener chapter:
// Estimated workload, captured as constants for sanity-check tests:
public static class CapacityEstimate
{
public const int DailyShortens = 1_000_000; // 1M new URLs/day
public const int DailyRedirects = 100_000_000; // 100M reads/day (100:1 read/write)
public const int PeakRedirectsPerSec = DailyRedirects / 100_000 * 5; // 5K req/s peak
public const int AvgUrlBytes = 200; // short + long + meta
public const long StorageOneYearGb = (long)DailyShortens * 365 * AvgUrlBytes / 1_000_000_000; // 73 GB
public const int CacheHitRatePct = 90; // hot 1% of URLs serve 90% of reads
public const int DbReadsPerSec = PeakRedirectsPerSec * (100 - CacheHitRatePct) / 100; // 500 RPS
}
Now the architecture choices have evidence:
- 5K reads/s peak - one ASP.NET Core box can handle it; two for HA.
- 500 DB reads/s after 90% cache hit rate - one Postgres node with no special tuning.
- 73 GB/year storage - well inside any free database tier.
- No sharding needed for at least three years of growth.
The whole design is one Postgres + one Redis + two stateless web nodes, and you can defend it numerically. Without the estimate, the same exercise produces an over-engineered Cassandra-and-Kafka nightmare.
Where do back-of-envelope estimates fail?
Two failure modes.
First, bursty workloads. The 5x peak factor in the Twitter example assumes diurnal traffic. If your workload spikes 100x for a flash sale, the average QPS is meaningless - design for the peak. Chapter 14 (rate limiting) shows how to cap the burst when you cannot scale to absorb it.
Second, storage that grows faster than write rate. Photos, videos, ML embeddings - the per-row size is large and you measure storage in bytes per second, not rows per second. Update the constant and the estimate works again, but a "small" service with 1M users uploading a 5 MB image each is still a 5 TB problem.
When should you skip back-of-envelope and just measure?
When the system already exists and you are tuning it. Real metrics from chapter 13 (observability) beat estimates by an order of magnitude. The estimate's job is to design the first version - to rule out clearly wrong architectures - not to compete with telemetry. Once the system is in production, the estimate becomes a sanity check on the dashboards, not the source of truth.
Where should you go from here?
Next chapter: CAP and consistency - the second vocabulary, the one that decides whether you can use one database or need to think about quorums. After that you have the full toolkit to start choosing concrete .NET building blocks.