Lifecycle Advanced 5 min read

Incident Response: Severity, Comms, Rollback

How to run an incident from page to all-clear: severity levels, incident commander role, comms cadence during the incident, and the rollback decision tree.

Table of contents
  1. When does incident process actually pay back?
  2. What is the cost of weak incident process?
  3. What does the incident response shape look like?
  4. What is the concrete incident runbook template?
  5. How does incident response scale to multi-team?
  6. What failure modes does the process introduce?
  7. When is formal incident process overkill?
  8. Where should you go from here?

The first time you are paged at 3 AM about a production incident, the difference between a 30-minute outage and a 3-hour outage is the runbook. Not the engineer's heroism, not the team's experience - the document that says what to check, who to call, when to roll back. This chapter shows the runbook shape and the incident process that uses it.

When does incident process actually pay back?

Three signals.

Production has external users. Internal-only tools tolerate hour-long outages; customer-facing services do not.

On-call covers more than one person. A single dedicated operator can hold all context in their head. The moment two people share rotation, written runbooks become essential.

SLA / SLO commitments exist. Whether contractual or internal, breach has a measurable cost. The process exists to keep breach rare.

If the system is internal-only, lightly used, and you can fix things at human pace, formal incident process is overhead.

What is the cost of weak incident process?

Three failure modes.

Heroic recovery, no learning. A senior engineer fixes the incident at 4 AM. No post-mortem. The same incident happens 6 weeks later because nothing changed.

Communication blackout. Engineers fix; nobody updates customers. Customers churn during the silence.

Rollback hesitation. "We're so close to fixing it" - 30 minutes becomes 2 hours. Rollback would have taken 5 minutes.

What does the incident response shape look like?

flowchart TB
    Page[Page fires] --> Tri[Triage<br/>0-5 min]
    Tri --> Sev{Severity?}
    Sev -->|P1| P1[Production down<br/>full team paged<br/>30-min comms]
    Sev -->|P2| P2[Degraded<br/>on-call + lead<br/>hourly comms]
    Sev -->|P3| P3[Cosmetic<br/>backlog ticket]
    P1 --> Decide{Fix in 30 min?}
    P2 --> Decide
    Decide -->|Yes| Fix[Fix forward]
    Decide -->|No| Roll[Roll back]
    Fix --> AllClear[Verify metrics<br/>all clear]
    Roll --> AllClear
    AllClear --> Post[Post-mortem<br/>within 48h]

P1 means production is down or critically degraded for customers; P2 means degraded but functional; P3 means minor. Severity drives who gets paged, comms cadence, and how aggressively to roll back.

What is the concrete incident runbook template?

# Incident Runbook — {{ Service Name }}

## Severity definitions
- **P1**: Service down OR error rate > 5% OR latency p99 > 5x baseline.
  Page everyone. Update customers every 30 min.
- **P2**: Degraded - error rate 0.5-5% OR latency p99 2-5x baseline.
  Page on-call + lead. Update hourly.
- **P3**: Cosmetic, single-user, or non-blocking. Ticket; respond within
  4 hours during business.

## On-call response (first 5 minutes)
1. Acknowledge the page
2. Open the incident channel: #incident-{{ timestamp }}
3. Pull up dashboards: {{ Grafana link }}
4. Identify severity from definitions above
5. If P1, page incident commander backup; activate comms lead

## Roles during incident
- **Incident commander** (IC): runs the channel, makes decisions
- **Comms lead**: writes the customer + stakeholder updates
- **Engineers**: investigate and fix
- IC and comms lead are NOT the same person

## Decision tree (check in order)

- Recent deploy in last 6 hours? -> Roll back deploy first (5 min)
- Database query slow? -> Check pg_stat_activity; kill long queries
- External dependency failing? -> Check status page; activate fallback
- Cache stampede? -> Pre-warm hot keys; see caching chapter

## Stakeholder comms templates

**P1 customer-facing**: We are aware of an issue affecting
{{ feature }}. Our team is investigating. We will update by
{{ time + 30 min }}.

**P1 internal**: [P1] {{ Service }} - {{ symptom }} - IC:
{{ name }} - Channel: #incident-{{ id }}

## After all-clear checklist

- Customer comms: resolved
- Incident channel: summary posted
- Post-mortem doc created (template link)
- Calendar slot booked for blameless review (within 48h)

The decision tree is the most useful part. At 3 AM, the on-call engineer should not be inventing the response - they should be following the tree.

How does incident response scale to multi-team?

flowchart TB
    Page[Page fires] --> OnCall[Service on-call<br/>P1 in 5 min]
    OnCall --> Multi{Touches multiple<br/>services?}
    Multi -->|Yes| WarRoom[Open war room<br/>page IC + comms]
    Multi -->|No| Single[Single team handles]
    WarRoom --> Other1[Other service on-call<br/>joins channel]
    WarRoom --> Other2[Customer support<br/>joins channel]
    WarRoom --> Other3[Comms lead<br/>writes customer update]

The war room is the multi-team incident pattern. Single Slack channel; IC has authority across teams; each team's on-call joins. The comms lead writes one update that goes to all audiences (the stakeholder plan lists them).

What failure modes does the process introduce?

When is formal incident process overkill?

Two cases.

Single-engineer service. One person owns the system, gets paged, fixes it, writes a Slack message about what happened. Adding a 5-page runbook is overhead.

Internal beta. A feature behind an internal-only flag breaking is not a customer incident. Triage in business hours.

The full process scales with customer impact and team size. Below the threshold, lighter cadence works.

Where should you go from here?

Next chapter: retrospective and post-mortem

Frequently asked questions

When do I roll back vs fix forward?
Default to roll back. If the incident is the result of a recent change and rollback is < 30 minutes, do it now and analyze later. Fix forward only when rollback is impossible (data already migrated, third-party API committed) or the fix is genuinely faster than rollback. Most teams underestimate rollback speed and overestimate fix speed under pressure.
Who is the incident commander?
Whoever was on-call when the page fired, plus a hand-off if the incident exceeds their expertise. The incident commander is not the most senior engineer by default - they are the person making the calls. Their job is to decide, not to fix. The observability chapter from System Design covers the metrics that drive the decisions.
How often should I update during an incident?
P1 (production down): every 30 minutes, even if the update is 'still investigating, no progress'. Silence is worse than bad news. P2 (degraded): every hour. The customer support team forwards your update to customers; without it, they make things up. The stakeholder chapter covers the comms channels.
What does blameless post-mortem actually mean?
The post-mortem names what happened and why, but not whose fault it was. The framing is 'this could have happened to anyone with our tooling and process; how do we make our tooling and process catch it next time'. Blameless does not mean accountability-free; it means the accountability is on systems, not individuals.