Zero-Downtime Deployment — Blue-Green, Canary, and Rolling Update
Posted on: 4/18/2026 12:11:02 AM
Table of contents
- 1. Why do we need Zero-Downtime Deployment?
- 2. Rolling Update — The simplest and most popular option
- 3. Blue-Green Deployment — Instant switch, one-second rollback
- 4. Canary Release — Safe deploys with fine-grained traffic control
- 5. A/B Testing Deployment — Deploy by user segment
- 6. Side-by-side comparison
- 7. Database migrations in a zero-downtime world
- 8. Health checks — the foundation under every strategy
- 9. Graceful Shutdown — the step people skip
- 10. Which strategy is right for your team?
- References
Every deploy is nerve-wracking. A 5-minute outage can mean thousands of lost orders, serious revenue damage, and — even worse — lost user trust. In a world where 99.99% uptime (only ~52 minutes of downtime per year) is the minimum expectation, understanding and correctly applying zero-downtime deployment strategies is no longer "nice to have" — it's a hard requirement.
This article goes deep on the four most common deployment strategies: Rolling Update, Blue-Green Deployment, Canary Release, and A/B Testing Deployment. Each fits different situations — there's no "one size fits all".
1. Why do we need Zero-Downtime Deployment?
Before diving into each strategy, let's understand why traditional deployment (stop → deploy → start) no longer cuts it:
graph LR
A[🔴 Traditional Deploy] --> B[Stop Server]
B --> C[Copy Files]
C --> D[Run Migrations]
D --> E[Start Server]
E --> F[Health Check]
G[🟢 Zero-Downtime] --> H[Deploy New Version]
H --> I[Health Check New]
I --> J[Switch Traffic]
J --> K[Drain Old Connections]
style A fill:#e94560,stroke:#fff,color:#fff
style G fill:#4CAF50,stroke:#fff,color:#fff
style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style C fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style D fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style E fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style F fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style H fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style I fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style J fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
style K fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
In traditional deployment, the gap between "Stop Server" and a successful health check can span seconds to minutes. Every request during that window fails. Zero-downtime keeps at least one healthy version serving traffic at all times.
Prerequisites
To apply any zero-downtime strategy, the application must satisfy: (1) Backward-compatible database migrations — the new and old versions must both read/write the DB, (2) Health check endpoints — the load balancer needs to know which instances are healthy, (3) Graceful shutdown — finish in-flight requests before stopping, (4) Stateless applications — or sticky sessions if state is unavoidable.
2. Rolling Update — The simplest and most popular option
How it works
Rolling Update replaces old instances with new ones, either one at a time or in batches. At any moment there's always enough capacity serving traffic (minus whatever is being upgraded).
sequenceDiagram
participant LB as Load Balancer
participant P1 as Pod 1 (v1)
participant P2 as Pod 2 (v1)
participant P3 as Pod 3 (v1)
Note over LB,P3: Step 1: Start Rolling Update
LB->>P1: Drain connections
Note over P1: Terminate v1, Start v2
P1-->>LB: Health check OK (v2)
Note over LB,P3: Step 2: Continue with the next pod
LB->>P2: Drain connections
Note over P2: Terminate v1, Start v2
P2-->>LB: Health check OK (v2)
Note over LB,P3: Step 3: Final pod
LB->>P3: Drain connections
Note over P3: Terminate v1, Start v2
P3-->>LB: Health check OK (v2)
Note over LB,P3: ✅ Complete — every pod runs v2
Kubernetes configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-api
spec:
replicas: 4
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Add at most 1 new pod at a time
maxUnavailable: 1 # At most 1 pod unavailable
template:
spec:
containers:
- name: api
image: myapp:2.0
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
terminationGracePeriodSeconds: 30
Key tip: maxSurge vs maxUnavailable
maxSurge: 25% + maxUnavailable: 0 is the safest setting — 100% capacity is always preserved. The trade-off is extra resource overhead for surge pods. If the cluster has resource constraints, use maxSurge: 0 + maxUnavailable: 1 to save resources but temporarily reduce capacity.
✅ Pros
- Simple configuration, default in Kubernetes
- No need to double the infrastructure
- Automatic rollback on health check failures
- Fits most stateless applications
⚠️ Cons
- During the update, two versions run in parallel
- No instant rollback — you must roll forward again
- Hard to control the % of traffic going to the new version
- Database migrations must be backward-compatible
3. Blue-Green Deployment — Instant switch, one-second rollback
How it works
Maintain two identical production environments: Blue (live) and Green (idle). Deploy the new version to Green, test thoroughly, then switch 100% of the traffic to Green. Blue becomes the backup — rollback is just a traffic switch back.
graph TB
subgraph "Before deploy"
U1[Users] --> LB1[Load Balancer]
LB1 --> B1[🔵 Blue - v1.0
ACTIVE]
G1[🟢 Green - v0.9
IDLE]
end
subgraph "Deploy v2.0 to Green"
U2[Users] --> LB2[Load Balancer]
LB2 --> B2[🔵 Blue - v1.0
ACTIVE]
G2[🟢 Green - v2.0
TESTING]
end
subgraph "Switch traffic"
U3[Users] --> LB3[Load Balancer]
B3[🔵 Blue - v1.0
STANDBY]
LB3 --> G3[🟢 Green - v2.0
ACTIVE]
end
style B1 fill:#2196F3,stroke:#fff,color:#fff
style G1 fill:#f8f9fa,stroke:#e0e0e0,color:#888
style B2 fill:#2196F3,stroke:#fff,color:#fff
style G2 fill:#4CAF50,stroke:#fff,color:#fff
style B3 fill:#f8f9fa,stroke:#e0e0e0,color:#888
style G3 fill:#4CAF50,stroke:#fff,color:#fff
style LB1 fill:#2c3e50,stroke:#fff,color:#fff
style LB2 fill:#2c3e50,stroke:#fff,color:#fff
style LB3 fill:#2c3e50,stroke:#fff,color:#fff
style U1 fill:#e94560,stroke:#fff,color:#fff
style U2 fill:#e94560,stroke:#fff,color:#fff
style U3 fill:#e94560,stroke:#fff,color:#fff
Implementing with AWS ECS
{
"deploymentController": {
"type": "CODE_DEPLOY"
},
"deploymentConfiguration": {
"deploymentCircuitBreaker": {
"enable": true,
"rollback": true
}
}
}
AWS CodeDeploy with ECS natively supports Blue-Green. Configure it via appspec.yaml:
version: 0.0
Resources:
- TargetService:
Type: AWS::ECS::Service
Properties:
TaskDefinition: "arn:aws:ecs:region:account:task-def/myapp:2"
LoadBalancerInfo:
ContainerName: "web-api"
ContainerPort: 8080
Hooks:
- BeforeAllowTraffic: "LambdaFunctionForValidation"
- AfterAllowTraffic: "LambdaFunctionForSmokeTest"
Implementing with Nginx (self-managed)
upstream blue {
server 10.0.1.10:8080;
server 10.0.1.11:8080;
}
upstream green {
server 10.0.2.10:8080;
server 10.0.2.11:8080;
}
# Use map to switch traffic
map $cookie_deployment $backend {
default blue; # ← change to "green" when switching
"green" green;
}
server {
listen 80;
location / {
proxy_pass http://$backend;
proxy_set_header X-Deployment-Target $backend;
}
}
Double the infrastructure cost
Blue-Green requires two identical production environments. On the cloud you can optimize by spinning Green up only when deploying, then tearing it down after confirming stability (usually 1-24 hours). AWS ECS and Azure Container Apps support this model natively.
✅ Pros
- Instant rollback (just switch traffic back)
- Production-like testing before the switch
- No mixed versions in production
- Conceptually simple
⚠️ Cons
- Double infrastructure cost
- Database migrations are tricky (both envs share the DB)
- No gradual rollout — 100% of traffic switches at once
- Session persistence must be handled carefully
4. Canary Release — Safe deploys with fine-grained traffic control
How it works
The term "canary" comes from miners carrying canaries to detect toxic gases early. Similarly, a Canary Release sends a small fraction of traffic (1-5%) to the new version. If metrics look healthy → ramp it up. If something breaks → roll back immediately, affecting only a few users.
graph LR
U[Users
100%] --> LB[Load Balancer
Traffic Split]
LB -->|95%| V1[Version 1.0
Stable]
LB -->|5%| V2[Version 2.0
Canary]
V2 --> M[Metrics
Monitoring]
M -->|OK| INC[Ramp to 25%
→ 50% → 100%]
M -->|Error| RB[Rollback
0% canary]
style U fill:#e94560,stroke:#fff,color:#fff
style LB fill:#2c3e50,stroke:#fff,color:#fff
style V1 fill:#2196F3,stroke:#fff,color:#fff
style V2 fill:#4CAF50,stroke:#fff,color:#fff
style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style INC fill:#4CAF50,stroke:#fff,color:#fff
style RB fill:#e94560,stroke:#fff,color:#fff
Canary configuration on Kubernetes with Istio
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: web-api
spec:
hosts:
- web-api.example.com
http:
- route:
- destination:
host: web-api
subset: stable
weight: 95
- destination:
host: web-api
subset: canary
weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: web-api
spec:
host: web-api
subsets:
- name: stable
labels:
version: v1
- name: canary
labels:
version: v2
Automated Canary with Flagger
Flagger is a progressive delivery tool for Kubernetes that automates the entire canary process based on metrics:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: web-api
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-api
service:
port: 8080
analysis:
# Bump by 10% each interval if metrics pass
interval: 1m
threshold: 5 # Failures before rollback
maxWeight: 50 # Max 50% traffic on canary
stepWeight: 10 # 10% increments
metrics:
- name: request-success-rate
thresholdRange:
min: 99 # Require 99%+ success rate
interval: 1m
- name: request-duration
thresholdRange:
max: 500 # P99 latency < 500ms
interval: 1m
webhooks:
- name: smoke-test
type: pre-rollout
url: http://flagger-loadtester/
metadata:
cmd: "curl -s http://web-api-canary:8080/health"
Canary metrics to watch
The four most important metrics for canary analysis (per the DORA framework): Error Rate — the 5xx/4xx rate vs stable, Latency P99 — must match or beat stable, Throughput — canary handles equivalent traffic per instance, Saturation — no abnormal CPU/memory spikes. If any of them breaches the threshold → automatic rollback.
✅ Pros
- Low risk — only a small % of users affected
- Catch production bugs early under real traffic
- Auto-rollback driven by metrics
- Fits large-scale systems
⚠️ Cons
- Requires a service mesh or advanced load balancer
- Setting up the monitoring pipeline is complex
- Deploys are slower (you wait at each step)
- Two versions live concurrently — DB migrations must be compatible
5. A/B Testing Deployment — Deploy by user segment
How it works
Unlike Canary (random traffic split), A/B Testing Deployment routes based on user attributes: location, device, user ID, subscription tier, and so on. The goal is both safe deployment and measuring the impact of the new feature.
# Istio VirtualService with header-based routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: web-api
spec:
hosts:
- web-api.example.com
http:
# Route the internal team to the new version
- match:
- headers:
x-user-tier:
exact: "internal"
route:
- destination:
host: web-api
subset: v2
# Route premium users
- match:
- headers:
x-user-tier:
exact: "premium"
route:
- destination:
host: web-api
subset: v2
# Everyone else → stable
- route:
- destination:
host: web-api
subset: v1
6. Side-by-side comparison
| Criterion | Rolling Update | Blue-Green | Canary Release | A/B Testing |
|---|---|---|---|---|
| Setup complexity | ⭐ Low | ⭐⭐ Medium | ⭐⭐⭐ High | ⭐⭐⭐ High |
| Rollback speed | Slow (minutes) | Instant (seconds) | Fast (seconds) | Fast (seconds) |
| Infrastructure cost | Lowest | Double | +10-50% | +10-50% |
| Blast radius on failure | ~25-50% users | 100% or 0% | 1-10% users | Specific segment |
| Mixed versions | Yes (temporarily) | No | Yes | Yes |
| Needs Service Mesh | No | No | Recommended | Required |
| Best fit for | Most applications | Critical systems, infrequent deploys | Large-scale, frequent deploys | Feature experiments |
7. Database migrations in a zero-downtime world
This is the hardest part. When two versions run concurrently (Rolling, Canary), both v1 and v2 must be compatible with the same database schema. The golden rule: the Expand-and-Contract pattern.
graph TB
subgraph "Phase 1: Expand"
E1[Deploy v2 code] --> E2[Add new column
nullable/default]
E2 --> E3[v1 and v2 both work
v2 starts writing the new column]
end
subgraph "Phase 2: Migrate"
M1[Backfill data
for the new column] --> M2[v2 reads/writes the new column]
M2 --> M3[Retire v1 completely]
end
subgraph "Phase 3: Contract"
C1[Drop the old column] --> C2[Remove legacy code]
C2 --> C3[Schema clean]
end
E3 --> M1
M3 --> C1
style E1 fill:#2196F3,stroke:#fff,color:#fff
style E2 fill:#2196F3,stroke:#fff,color:#fff
style E3 fill:#2196F3,stroke:#fff,color:#fff
style M1 fill:#4CAF50,stroke:#fff,color:#fff
style M2 fill:#4CAF50,stroke:#fff,color:#fff
style M3 fill:#4CAF50,stroke:#fff,color:#fff
style C1 fill:#e94560,stroke:#fff,color:#fff
style C2 fill:#e94560,stroke:#fff,color:#fff
style C3 fill:#e94560,stroke:#fff,color:#fff
Example: renaming a column
Want to rename user_name to display_name? You can't just ALTER TABLE RENAME COLUMN — v1 will crash immediately. Do it this way instead:
-- Phase 1: Expand (deploy before the code change)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);
-- Trigger to keep the two columns in sync during the transition
CREATE TRIGGER sync_display_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW
BEGIN
IF NEW.display_name IS NULL THEN
SET NEW.display_name = NEW.user_name;
END IF;
IF NEW.user_name IS NULL THEN
SET NEW.user_name = NEW.display_name;
END IF;
END;
-- Phase 2: Backfill existing data
UPDATE users SET display_name = user_name WHERE display_name IS NULL;
-- Phase 3: After v1 is fully retired
ALTER TABLE users DROP COLUMN user_name;
DROP TRIGGER sync_display_name;
Never in a single deploy
Each phase is its own deploy. Phase 1 ships first and stabilizes for 1-2 days. Phase 2 finishes the backfill. Only then does Phase 3 ship the code that drops the old column. Many teams fail by cramming all three phases into a single migration script — that's the root cause of outages.
8. Health checks — the foundation under every strategy
Without good health checks, every deploy strategy is meaningless. Health checks must distinguish three states:
// .NET Minimal API health check example
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
// Liveness: is the app running at all?
Predicate = check => check.Tags.Contains("live")
});
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
// Readiness: is the app ready to receive traffic?
Predicate = check => check.Tags.Contains("ready")
});
app.MapHealthChecks("/health/startup", new HealthCheckOptions
{
// Startup: for slow-starting containers
Predicate = check => check.Tags.Contains("startup")
});
// Register checks
builder.Services.AddHealthChecks()
.AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })
.AddSqlServer(connectionString, tags: new[] { "ready" })
.AddRedis(redisConnection, tags: new[] { "ready" })
.AddCheck<WarmupCheck>("warmup", tags: new[] { "startup" });
# Kubernetes probe configuration
containers:
- name: api
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3 # 3 failures → restart container
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 2 # 2 failures → remove from Service
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
failureThreshold: 30 # Allow 150s for warm-up
9. Graceful Shutdown — the step people skip
When Kubernetes sends SIGTERM to a pod, the application needs to:
sequenceDiagram
participant K8s as Kubernetes
participant Pod as Application Pod
participant LB as Service/Endpoint
K8s->>Pod: SIGTERM
K8s->>LB: Remove Pod from Endpoints
Note over Pod: Start graceful shutdown
Pod->>Pod: Stop accepting NEW requests
Pod->>Pod: Finish in-flight requests (max 30s)
Pod->>Pod: Close DB connections
Pod->>Pod: Flush logs/metrics
Pod-->>K8s: Exit 0
Note over K8s: If it doesn't exit within
terminationGracePeriodSeconds
→ SIGKILL
// .NET graceful shutdown
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();
var lifetime = app.Services.GetRequiredService<IHostApplicationLifetime>();
lifetime.ApplicationStopping.Register(() =>
{
// Wait for in-flight requests to complete
// The preStop hook's sleep 15 ensures the endpoint has been removed
Log.Information("Shutting down gracefully...");
});
lifetime.ApplicationStopped.Register(() =>
{
Log.CloseAndFlush();
});
preStop hook: why the sleep?
Kubernetes sends SIGTERM and removes the endpoint simultaneously, not sequentially. That means for a few seconds after SIGTERM, the load balancer can still send traffic to a shutting-down pod. preStop: sleep 15 gives enough time for endpoint propagation to complete before the app actually starts shutting down.
10. Which strategy is right for your team?
Mixing in practice
Most production systems don't use a single strategy. A common pattern: Rolling Update for small config changes + Canary for major feature releases + Blue-Green for infrastructure upgrades (Kubernetes version, database engine). Match the strategy to the risk level of each change, rather than applying one formula to everything.
References
- Kubernetes Documentation — Rolling Update Deployment
- Flagger — Progressive Delivery with Istio
- AWS CodeDeploy — Blue/Green Deployments on ECS
- Azure Container Apps — Blue-Green Deployment
- Google Cloud — Canary Deployment Strategy
- Martin Fowler — Blue-Green Deployment
- DORA — Accelerate State of DevOps Report
HTTP/3 and QUIC — The Next-Generation Network Protocol Accelerating the Web in 2026
Database Sharding — Data Partitioning Strategies When Your System Hits the Ceiling
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.