Foundations Beginner 6 min read

Scale Vocabulary: Throughput, Latency, QPS, SLA

Q: What is the difference between SLA, SLO, SLI?

SLI (indicator) is the *measurement* - 'success rate of /checkout in the last 5 minutes'. SLO (objective) is your internal *target* - '99.9% over 30 days'. SLA (agreement) is the external *promise*, often with money on the line - 'we refund if monthly availability drops below 99.5%'. SLA is always weaker than SLO so you have margin before a breach.

The scalability words every system design discussion uses: throughput vs latency, vertical vs horizontal, QPS, p50/p99, SLA vs SLO vs SLI. Defined for .NET engineers.

Phùng Anh Tú · May 6, 2026

Table of contents

What does throughput vs latency mean - and why do they trade off?
Why do we quote p50, p95, p99 instead of an average?
What is the difference between vertical and horizontal scaling?
What numbers should I memorise to think in QPS?
How do SLI, SLO, and SLA relate to one another?
What does this vocabulary look like in a real .NET service?
When does this vocabulary stop being useful?
Where should you go from here?

System design discussions move quickly because everyone uses the same six words: throughput, latency, vertical scaling, horizontal scaling, QPS, percentile. If those words mean different things to two people in the room, the discussion is theatre. This chapter defines them once, in the way the rest of the series will use them, and grounds each in a .NET tool you can actually measure with.

What does throughput vs latency mean - and why do they trade off?

Throughput is how many requests the system finishes per second (req/s, often abbreviated QPS for "queries per second"). Latency is how long one request waits before getting a response (ms).

They are not the same axis. A service can have high throughput and high latency (a batch job that returns 10000 results in 5 seconds), or low throughput and low latency (a single-user dev box). The trade-off shows up under load: as you push QPS toward the system's ceiling, queues form, and queueing increases latency. Little's Law makes the relationship explicit:

average concurrent requests = throughput * average latency

If you serve 1000 req/s with average latency 50 ms, you have 50 concurrent in-flight requests. Push to 100 ms latency and you have 100 concurrent - which means more thread/connection pool pressure, more GC, more contention, and more latency. The feedback loop is why services collapse non-linearly under sustained overload.

graph LR
    QPS[Throughput up] --> Q[Queue grows]
    Q --> Lat[Latency up]
    Lat --> Conc[Concurrent up]
    Conc --> Press[Resource pressure]
    Press --> Lat
    style Lat fill:#fbb,stroke:#c00

Why do we quote p50, p95, p99 instead of an average?

Latency distributions are not normal - they are heavy-tailed. A typical web service might look like 30 ms for 95 of 100 requests, then 200 ms for 4, then 2000 ms for 1. The average is 56 ms; the user waiting 2 seconds does not feel "average".

Percentiles slice the distribution honestly. p50 (median) - half of users wait this long or less. p95 - one in twenty waits more. p99 - one in a hundred. p99.9 - one in a thousand, the number that pages oncall.

Two services with the same average can have wildly different p99. Pick the percentile that matches your user pain - p95 for casual UX, p99 for transactional flows, p99.9 for payment APIs. ASP.NET Core's OpenTelemetry instrumentation emits these directly:

builder.Services.AddOpenTelemetry()
    .WithMetrics(b => b
        .AddAspNetCoreInstrumentation()    // http.server.request.duration histogram
        .AddRuntimeInstrumentation()        // GC, thread pool
        .AddPrometheusExporter());

Then in Grafana you query histogram_quantile(0.99, ...) and you have your number. Chapter 13 covers the wiring; this chapter just sets the expectation that every latency claim must name its percentile.

What is the difference between vertical and horizontal scaling?

Vertical scaling means making one box bigger - more CPU, more RAM, faster disk. The cap is that single-box prices grow super-linearly: the next tier of CPU often costs 4x for 2x performance, and at some point you are paying for hardware that does not exist on the cloud provider's menu.

Horizontal scaling means adding more boxes that handle parts of the load. The cap there is coordination cost: the boxes need to agree on data, route traffic to each other, fail over when one dies. A stateless web tier scales horizontally almost for free; a stateful database does not.

The pragmatic ladder for a .NET service: scale up the web tier first (go from 4 to 16 vCPUs on the App Service plan); when one box is no longer enough, add replicas behind a load balancer; when one database is the bottleneck, add a read replica; only when that is full do you look at sharding. Each step is cheaper than the next; jump steps only when you have evidence.

What numbers should I memorise to think in QPS?

Useful round numbers for back-of-envelope:

1 RPS = 86,400 req/day - so a 10 RPS service does ~864K requests/day, a 100 RPS service does ~8.6M.
1 KB row * 1M rows = 1 GB - so a million-user table with a few fields per row is ~1 GB, well inside any free database tier.
L1 cache ~1 ns, RAM ~100 ns, SSD ~100 µs, network round-trip ~500 µs LAN / ~50 ms internet - the seven-orders-of-magnitude gap between RAM and network is why caches exist.
Single Postgres node ~5K-20K writes/s - depending on row size, durability, fsync settings. If your design needs 100K writes/s you are in chapter 5 territory.

Chapter 2 (back-of-envelope) practices these numbers in earnest.

How do SLI, SLO, and SLA relate to one another?

Three nested circles, smallest to largest tolerance:

SLI (indicator) - the metric you actually measure. "Success rate of POST /checkout, 5-minute window". One number, sampled continuously.
SLO (objective) - your internal target on that SLI. "99.9% over 30 days". This is what the team optimises for.
SLA (agreement) - the external promise, usually with a refund attached. "Monthly availability >= 99.5% or we credit the customer 10%". Always weaker than the SLO so you have margin.

The arithmetic: 99.9% over 30 days = 43 minutes of allowed downtime. 99% = 7.2 hours. The math gets tight quickly. A two-hour incident twice a quarter blows a 99.9% SLO. The discipline of writing the SLO down forces honest conversations about which features deserve the extra nines and which can degrade.

What does this vocabulary look like in a real .NET service?

A simple ASP.NET Core service exposes the whole dictionary in one glance through dotnet-counters:

// Run from the command line on a deployed instance:
// dotnet-counters monitor --process-id <pid>
//   System.Runtime
//   Microsoft.AspNetCore.Hosting
//   Microsoft.AspNetCore.Server.Kestrel

// You will see live, per-second:
//   - requests-per-second                    (throughput / QPS)
//   - current-requests                       (Little's Law N)
//   - request-duration histogram             (latency distribution)
//   - cpu-usage, working-set, gc-pause-ms    (saturation signals)
//   - active-connections, queued-requests    (Kestrel back-pressure)

Every chapter in this series quotes one of these signals. The chapter on rate limiting caps QPS; the chapter on observability exports them to Prometheus; the chapter on circuit breakers reacts when they cross thresholds. Knowing the words means understanding what those chapters are doing with the words.

When does this vocabulary stop being useful?

Two cases.

First, when you are still in the prototype phase: precision about p99 is a waste while the schema is changing weekly. Use averages and "feels fast / feels slow" until traffic is real.

Second, for batch / ETL workloads where total job time matters more than per-request latency. Quoting p99 for a nightly Spark job is the wrong axis - the right metrics are job duration and cost per row. Most of the .NET services in this series are interactive request/ response, so percentile thinking applies; chapter 23 (analytics events pipeline) is the one place where batch metrics dominate.

Where should you go from here?

Next chapter: back-of-envelope arithmetic - turns this dictionary into numbers you can put on a whiteboard. After that, CAP and consistency introduces the second vocabulary that runs every database choice in the series.

Frequently asked questions

Why p99 instead of average latency?

Average hides the tail. A service with 99 requests at 50 ms and one at 5000 ms has an average of 99 ms - looks fine. The one customer waiting 5 seconds will not feel fine. p99 says 'one in a hundred users waits this long' which is the number that surfaces the actual user pain. p99.9 catches the alert-page-someone failures.

When do I scale up vs scale out?

Scale up (vertical) until the box is saturated or until the next CPU tier costs 4x for 2x performance - the common ceiling on a single VM. Scale out (horizontal) when the workload is naturally parallel - many small requests, no shared state - or when a single failure must not take down the whole service. The practical answer is scale up first because it is cheaper in code complexity until QPS forces your hand.

What is the difference between SLA, SLO, SLI?

SLI (indicator) is the measurement - 'success rate of /checkout in the last 5 minutes'. SLO (objective) is your internal target - '99.9% over 30 days'. SLA (agreement) is the external promise, often with money on the line - 'we refund if monthly availability drops below 99.5%'. SLA is always weaker than SLO so you have margin before a breach.

How does this vocabulary apply to .NET specifically?

The numbers come from concrete .NET tools. Kestrel exposes p50/p99 via OpenTelemetry. dotnet-counters shows GC pauses that inflate tail latency. ASP.NET Core's rate limiter (chapter 14) caps QPS. EF Core's command interceptor surfaces query latency. Every chapter quotes these numbers - this chapter is the dictionary.