Software Estimation: Three-Point, Reference Class, T-Shirt
How to estimate software work without lying to yourself: three-point estimation, reference class forecasting, T-shirt sizing, and when planning poker is worth the time.
Table of contents
- When does estimation actually have to be precise?
- What is the cost of bad estimation?
- What does the minimal estimation toolkit look like?
- How does three-point estimation work in practice?
- What is the concrete reference-class log?
- What does estimation look like at multi-team scale?
- What failure modes does estimation introduce?
- When is detailed estimation a waste of time?
- Where should you go from here?
The hardest question a tech lead is asked - "how long will this take?" - has no honest answer in single-number form. The honest answer is a range with a confidence level. This chapter teaches four techniques that produce defensible ranges, when to use which, and how to keep the team honest about calibration over time.
When does estimation actually have to be precise?
Three situations. Most others, rough order-of-magnitude is fine.
Hard external commitment. A regulatory deadline, a contract with a client penalty, a marketing launch tied to ad spend. The business has to commit money against the estimate, so the estimate needs a confidence interval, not a guess.
Quarterly planning. The org wants to know what gets done this quarter so it can sequence dependencies across teams. Imprecise estimates here cause cross-team thrash.
Build vs buy decisions. "Build it ourselves" vs "license a SaaS" turns on cost. The build estimate has to be defensible.
For everything else - sprint-level work, exploratory spikes, internal tools - "rough size" is enough. Over-estimating is its own expense.
What is the cost of bad estimation?
Three failure modes that show up later:
Death-march sprints. The project committed to a date based on a single-number estimate that was the optimistic case. When reality arrives, the team works weekends to hit the date. Quality drops, attrition rises.
Over-padding. The team learned from the previous death march and pads every estimate by 3x. Now real estimates are 6 weeks but they say 18; leadership stops trusting numbers.
Decision paralysis. "We can't decide build vs buy until we estimate the build." The estimate takes a month. By then the buy option has changed pricing.
What does the minimal estimation toolkit look like?
Four techniques, increasing in cost and accuracy:
flowchart LR
TS[T-Shirt sizing<br/>S/M/L/XL<br/>5 minutes] --> RC[Reference class<br/>past 3-5 projects<br/>30 minutes]
RC --> TP[Three-point<br/>best/likely/worst<br/>1 hour per item]
TP --> PP[Planning poker<br/>full team<br/>2 hours per backlog]
Use them in stages. T-shirt at the idea stage; reference class at the kickoff; three-point during planning; planning poker when the backlog is being broken into stories. Most teams skip directly to planning poker and miss the upstream techniques that catch order-of-magnitude errors.
How does three-point estimation work in practice?
For each task, ask three numbers:
- Best case (B) — everything goes right.
- Most likely (M) — your honest guess.
- Worst case (W) — the integration surprises hit.
Then compute PERT estimate and standard deviation:
estimate = (B + 4M + W) / 6
stddev = (W - B) / 6
range = estimate ± 2 * stddev (95% confidence)
Example: a task with B=2 days, M=4 days, W=10 days gives estimate = (2 + 16 + 10) / 6 = 4.67 days, stddev = 1.33 days, 95% range = 2-7 days. Report the range, not the point.
Sum the estimates and stddevs across the project (variances add). That gives you the project-level range. The shock for most teams is how wide it is once W terms compound — which is the point of showing the math to leadership.
What is the concrete reference-class log?
Keep a simple yaml file in the team's repo with past project durations, updated at every retrospective:
reference_class:
- project: "Admin dashboard for billing"
started: "2026-01-15"
shipped: "2026-03-10"
elapsed_weeks: 8
initial_estimate_weeks: 4
actual_vs_initial: 2.0x
pattern: "CRUD admin + 2 reports"
surprises: "auth provider rate limits; QA env flaky for 1 week"
- project: "Inventory webhook integration"
started: "2026-02-20"
shipped: "2026-04-05"
elapsed_weeks: 6
initial_estimate_weeks: 3
actual_vs_initial: 2.0x
pattern: "third-party API integration"
surprises: "vendor SDK undocumented edge cases"
- project: "Search re-indexing pipeline"
started: "2026-03-01"
shipped: "2026-04-10"
elapsed_weeks: 6
initial_estimate_weeks: 6
actual_vs_initial: 1.0x
pattern: "Elasticsearch migration"
surprises: "none - had a similar prior project"
Two things to notice. First, actual_vs_initial is consistently
~2x for new patterns and ~1x for repeat patterns. Second, the
surprises column captures why - and that becomes the input to next
quarter's risk register
(chapter 3).
What does estimation look like at multi-team scale?
Three changes when crossing 50+ engineers and multiple teams:
- Capacity, not duration. Big projects are estimated in team-quarters, not person-days. "This needs Team A for one quarter and Team B for two months" is the unit.
- Dependency Gantts. Inter-team dependencies appear; when
Team B's work depends on Team A's API, the worst-case slips
cascade. Use Mermaid
ganttin the planning chapter to expose them. - Confidence windows by phase. Discovery has 50% confidence; build has 80%; launch has 95%. The estimate tightens as the unknowns close. Report the confidence with the date.
The single biggest mistake at this scale is treating a quarterly plan as a fixed contract. It is a forecast with a confidence interval, and the interval shrinks over time.
What failure modes does estimation introduce?
- Anchoring on the first number. The PM says "we hoped for 4 weeks" and the team's three-point estimates all cluster around 4. Mitigation: ask each engineer privately first, reveal together; planning poker codifies this.
- Padding for safety. The team learned that estimates become commitments, so they 3x everything. Mitigation: explicitly track estimate vs actual at the team level; reward calibration, not hitting the date.
- Decomposition theatre. Breaking a 4-week task into 40 one-day tasks looks rigorous but adds no information. The big unknowns are not in the decomposition. Mitigation: spike the riskiest unknown for 1-2 days first, then re-estimate.
- One-size-fits-all method. Using planning poker for every decision wastes time on small things. Use T-shirt for big-picture, three-point for committed projects, planning poker for sprint-level only.
When is detailed estimation a waste of time?
Two cases.
Time-boxed exploration. "We have two weeks to learn whether this is feasible." No estimate needed; report the result on the deadline.
Truly mandatory work. A security patch must be applied this week. Estimating how long it takes does not change whether you do it. Just do it and report.
The estimation discipline earns its cost when leadership has to commit external resources against the answer. Below that threshold, "rough size + report on Friday" beats elaborate estimation rituals.
Where should you go from here?
Next chapter: risk and issue tracking - the artifact that catches the surprises your estimate did not see. After that, stakeholders and comms covers how to communicate the range honestly without losing trust.