Foundations Intermediate 5 min read

Software Estimation: Three-Point, Reference Class, T-Shirt

How to estimate software work without lying to yourself: three-point estimation, reference class forecasting, T-shirt sizing, and when planning poker is worth the time.

Table of contents
  1. When does estimation actually have to be precise?
  2. What is the cost of bad estimation?
  3. What does the minimal estimation toolkit look like?
  4. How does three-point estimation work in practice?
  5. What is the concrete reference-class log?
  6. What does estimation look like at multi-team scale?
  7. What failure modes does estimation introduce?
  8. When is detailed estimation a waste of time?
  9. Where should you go from here?

The hardest question a tech lead is asked - "how long will this take?" - has no honest answer in single-number form. The honest answer is a range with a confidence level. This chapter teaches four techniques that produce defensible ranges, when to use which, and how to keep the team honest about calibration over time.

When does estimation actually have to be precise?

Three situations. Most others, rough order-of-magnitude is fine.

Hard external commitment. A regulatory deadline, a contract with a client penalty, a marketing launch tied to ad spend. The business has to commit money against the estimate, so the estimate needs a confidence interval, not a guess.

Quarterly planning. The org wants to know what gets done this quarter so it can sequence dependencies across teams. Imprecise estimates here cause cross-team thrash.

Build vs buy decisions. "Build it ourselves" vs "license a SaaS" turns on cost. The build estimate has to be defensible.

For everything else - sprint-level work, exploratory spikes, internal tools - "rough size" is enough. Over-estimating is its own expense.

What is the cost of bad estimation?

Three failure modes that show up later:

Death-march sprints. The project committed to a date based on a single-number estimate that was the optimistic case. When reality arrives, the team works weekends to hit the date. Quality drops, attrition rises.

Over-padding. The team learned from the previous death march and pads every estimate by 3x. Now real estimates are 6 weeks but they say 18; leadership stops trusting numbers.

Decision paralysis. "We can't decide build vs buy until we estimate the build." The estimate takes a month. By then the buy option has changed pricing.

What does the minimal estimation toolkit look like?

Four techniques, increasing in cost and accuracy:

flowchart LR
    TS[T-Shirt sizing<br/>S/M/L/XL<br/>5 minutes] --> RC[Reference class<br/>past 3-5 projects<br/>30 minutes]
    RC --> TP[Three-point<br/>best/likely/worst<br/>1 hour per item]
    TP --> PP[Planning poker<br/>full team<br/>2 hours per backlog]

Use them in stages. T-shirt at the idea stage; reference class at the kickoff; three-point during planning; planning poker when the backlog is being broken into stories. Most teams skip directly to planning poker and miss the upstream techniques that catch order-of-magnitude errors.

How does three-point estimation work in practice?

For each task, ask three numbers:

Then compute PERT estimate and standard deviation:

estimate = (B + 4M + W) / 6
stddev   = (W - B) / 6
range    = estimate ± 2 * stddev   (95% confidence)

Example: a task with B=2 days, M=4 days, W=10 days gives estimate = (2 + 16 + 10) / 6 = 4.67 days, stddev = 1.33 days, 95% range = 2-7 days. Report the range, not the point.

Sum the estimates and stddevs across the project (variances add). That gives you the project-level range. The shock for most teams is how wide it is once W terms compound — which is the point of showing the math to leadership.

What is the concrete reference-class log?

Keep a simple yaml file in the team's repo with past project durations, updated at every retrospective:

reference_class:
  - project: "Admin dashboard for billing"
    started: "2026-01-15"
    shipped: "2026-03-10"
    elapsed_weeks: 8
    initial_estimate_weeks: 4
    actual_vs_initial: 2.0x
    pattern: "CRUD admin + 2 reports"
    surprises: "auth provider rate limits; QA env flaky for 1 week"

  - project: "Inventory webhook integration"
    started: "2026-02-20"
    shipped: "2026-04-05"
    elapsed_weeks: 6
    initial_estimate_weeks: 3
    actual_vs_initial: 2.0x
    pattern: "third-party API integration"
    surprises: "vendor SDK undocumented edge cases"

  - project: "Search re-indexing pipeline"
    started: "2026-03-01"
    shipped: "2026-04-10"
    elapsed_weeks: 6
    initial_estimate_weeks: 6
    actual_vs_initial: 1.0x
    pattern: "Elasticsearch migration"
    surprises: "none - had a similar prior project"

Two things to notice. First, actual_vs_initial is consistently ~2x for new patterns and ~1x for repeat patterns. Second, the surprises column captures why - and that becomes the input to next quarter's risk register (chapter 3).

What does estimation look like at multi-team scale?

Three changes when crossing 50+ engineers and multiple teams:

The single biggest mistake at this scale is treating a quarterly plan as a fixed contract. It is a forecast with a confidence interval, and the interval shrinks over time.

What failure modes does estimation introduce?

When is detailed estimation a waste of time?

Two cases.

Time-boxed exploration. "We have two weeks to learn whether this is feasible." No estimate needed; report the result on the deadline.

Truly mandatory work. A security patch must be applied this week. Estimating how long it takes does not change whether you do it. Just do it and report.

The estimation discipline earns its cost when leadership has to commit external resources against the answer. Below that threshold, "rough size + report on Friday" beats elaborate estimation rituals.

Where should you go from here?

Next chapter: risk and issue tracking - the artifact that catches the surprises your estimate did not see. After that, stakeholders and comms covers how to communicate the range honestly without losing trust.

Frequently asked questions

Why are software estimates always wrong?
Because you are estimating work that has never been done. If it had been done, you would copy it. Every estimate is a guess about discovery time, integration surprises, and the team's skill on this specific problem. Treat estimates as ranges with confidence levels, not single numbers. The back-of-envelope chapter from System Design uses the same idea for capacity.
What is reference class forecasting?
Look at the last 3-5 projects of a similar shape and use their actual durations as the prior. 'Last time we built a CRUD admin for a similar domain, it took 6 weeks.' This beats bottom-up estimation by 30-40% in research from Flyvbjerg, because it implicitly captures all the unknown-unknowns that bottom-up misses. Build a small log of past project durations - that is your reference class.
When is T-shirt sizing better than story points?
Early in a project, before tasks are decomposed. T-shirts (S, M, L, XL) communicate magnitude without false precision - 'XL means more than a sprint, we should split it'. Story points are for sprint-level planning, after the work is broken down. Mixing the two ('this is 13 points') invites false confidence.
How do I track estimate vs actual without it becoming a punishment?
Track at the team level, not the individual level. The metric is 'are our estimates within 50% of actual on average', not 'whose estimate was wrong'. Report the calibration trend in the retrospective every quarter; adjust estimation rules based on what you learn. If individuals feel scored, accuracy drops because they pad.