Lifecycle Intermediate 5 min read

Launch and Handover: The Go-Live Checklist That Survives

How to launch software safely: go-live checklist, rollback plan, ops handover doc, and the day-zero communication plan that keeps stakeholders informed.

Table of contents
  1. When does formal launch process pay back?
  2. What is the cost of skipping launch artifacts?
  3. What is the minimal go-live checklist?
  4. What does the ops handover doc look like?
  5. How does launch scale to multi-team?
  6. What failure modes does launch process introduce?
  7. When is launch ceremony overkill?
  8. Where should you go from here?

The launch is the moment the project meets the user. Done well, the launch feels boring - because the team rehearsed it. Done poorly, the team learns about every assumption that was wrong at once. This chapter shows the three artifacts (checklist, rollback, handover) that make boring launches reliable.

When does formal launch process pay back?

Three signals.

User-facing change. Anything customers see needs a launch window with rollback. Internal-only changes can be lighter.

Irreversible without effort. Database schema migrations, data backfills, billing changes. Once shipped, undoing costs hours or days. The launch checklist is your insurance.

Cross-team coordination. Marketing announce, customer support notify, on-call standby. Without a written launch plan, one of the three forgets.

For a feature flag rollout to 1% of users with no data changes, the launch is "flip flag, watch dashboard". No formal artifact needed.

What is the cost of skipping launch artifacts?

Three failure modes.

Surprise rollback. Engineering ships, support gets paged about confusion, marketing's email goes out as planned, and nobody knows whether to roll back. Coordination collapses.

3 AM page on the wrong person. No handover doc means the oncall engineer who picks up the page has no context. Mean time to recovery doubles.

Stakeholder shock. A launch the sponsor expected on Monday ships on Wednesday because of unwritten dependencies. Trust erodes; the team has to apologise without good explanation.

What is the minimal go-live checklist?

# Go-Live Checklist — {{ Feature / Project }}

## Pre-launch (T-2 days)
- [ ] Feature flag created and defaults to off
- [ ] Migration scripts tested in staging twice
- [ ] Rollback plan written and reviewed (see below)
- [ ] On-call schedule for launch window confirmed
- [ ] Customer support runbook updated; CS team briefed
- [ ] Stakeholder comms sent (email + Slack #launches channel)

## Launch window (T-0)
- [ ] Deploy artifact to production
- [ ] Smoke tests pass
- [ ] Flag enabled for internal users (1%)
- [ ] Metrics dashboard green for 30 minutes
- [ ] Flag enabled for 10% of users
- [ ] Metrics green for 60 minutes
- [ ] Flag enabled for 100% of users
- [ ] Announcement sent

## Post-launch (T+24h)
- [ ] No new incidents tied to this launch
- [ ] Customer support ticket volume normal
- [ ] Performance metrics within target
- [ ] Handover doc complete and assigned
- [ ] Project channel archived; ops channel has owner

## Rollback plan
**Trigger**: error rate > 0.5% sustained 5 min, OR latency p99
> 2x baseline sustained 10 min, OR customer complaints volume
> 10x baseline.

**Steps** (target: < 30 min):
1. Set feature flag to off (1 min)
2. Verify metrics return to baseline (5 min)
3. If migration involved: run rollback migration (10 min)
4. Notify stakeholders + support (5 min)
5. Open incident channel for root cause analysis

**Owner during launch window**: {{ on-call lead name }}

Two details. The pre-launch section is checked 48 hours ahead so gaps surface early. The rollback trigger has specific thresholds, not "if it feels broken" - this stops debate during the incident.

What does the ops handover doc look like?

# Ops Handover — {{ Feature }}

## Who owns this in production
Team: {{ Team name }}
On-call rotation: {{ link to PagerDuty schedule }}

## Architecture summary
{{ One-paragraph description of the system + Mermaid link }}

## Common issues and fixes
| Symptom | Likely cause | Fix |
|---------|--------------|-----|
| 500s on /refund | Stripe sandbox flake | Retry; check Stripe status |
| Slow page load on /orders | Cache miss spike | Wait 5 min for cache warmup |

## Dashboards
- Service metrics: {{ Grafana link }}
- Business metrics: {{ link }}
- Error logs: {{ link }}

## Runbooks
- Refund failed: {{ link }}
- Migration rollback: {{ link }}

## Escalation
- P1 (down): page on-call; tag #incident
- P2 (degraded): file ticket; respond in 4 hours
- P3 (cosmetic): backlog

## Out of scope (what this team does NOT own)
- Payment provider integration (Platform team)
- Email delivery (Comms team)

This is the doc the on-call engineer opens at 3 AM. Make it short, copy-pastable into Slack, and link-rich.

How does launch scale to multi-team?

sequenceDiagram
    participant PM as PM
    participant Eng as Engineering
    participant Mkt as Marketing
    participant CS as Customer Support
    participant On as On-call
    PM->>Eng: T-7 days: launch readiness review
    Eng->>PM: Confirm checklist green
    PM->>CS: T-3 days: brief support team
    PM->>Mkt: T-2 days: confirm comms scheduled
    PM->>On: T-1 day: confirm coverage
    Eng->>On: T-0: deploy + ramp
    On->>PM: T+24h: handover complete

Each team has a step in the launch sequence with a deadline. The PM coordinates; each team has its own checklist. The stakeholder chapter covers the comms cadence around launch.

What failure modes does launch process introduce?

When is launch ceremony overkill?

Two cases.

Internal tool used by 5 people. A change to your own team's admin dashboard does not need a 48-hour comms window.

Reversible by environment variable. A config-only change you can revert in 30 seconds is not a launch; it is a config push. Document the revert command and ship.

The launch process earns its overhead at user-facing scope, irreversibility, or cross-team dependency. Below that, lighter process works.

Where should you go from here?

Next chapter: incident and rollback - what happens when the launch checklist is not enough and something breaks. After that, retrospective and post-mortem turns lessons into changes for the next project.

Frequently asked questions

Feature flag or big-bang launch?
Feature flag almost always. The flag lets you flip the launch off in one click if something goes wrong, and lets you ramp to 1% / 10% / 100% of users gradually. Big-bang launches still happen for things you cannot flag (database migrations, billing changes), and there the rollback plan must be solid - covered in the incident chapter.
Who owns the system after launch?
Whoever signed the RACI from chapter 4 for ops. The handover doc names them. If no team has signed up to own ops, do not launch - 'we'll figure it out' guarantees a 3 AM page that lands on the engineer who built it. Ownership before launch, not after.
How long should the launch window be?
Plan a 4-hour window for the change itself plus 48 hours of heightened attention. The 4 hours covers deploy, smoke test, ramp, and verification. The 48 hours is when on-call watches metrics extra-carefully. Communicate both windows to stakeholders so they know when to expect what.
What if the launch is bigger than one window?
Phase it. A migration that takes a week becomes 5 daily 4-hour changes, each with its own checklist. The planning chapter covers phased launches. Multi-day launches without phases mean you are flying without a parachute - if something breaks at hour 30 of 60, rollback is much harder.