Zero-Downtime Deployment — Blue-Green, Canary, and Rolling Update

Posted on: 4/18/2026 12:11:02 AM

Every deploy is nerve-wracking. A 5-minute outage can mean thousands of lost orders, serious revenue damage, and — even worse — lost user trust. In a world where 99.99% uptime (only ~52 minutes of downtime per year) is the minimum expectation, understanding and correctly applying zero-downtime deployment strategies is no longer "nice to have" — it's a hard requirement.

$5,600 Average cost per minute of downtime (Gartner)
52 min Max annual downtime for a 99.99% SLA
46× More deploys among Elite DevOps teams (DORA)
7,200+ Deploys per day at Amazon (average)

This article goes deep on the four most common deployment strategies: Rolling Update, Blue-Green Deployment, Canary Release, and A/B Testing Deployment. Each fits different situations — there's no "one size fits all".

1. Why do we need Zero-Downtime Deployment?

Before diving into each strategy, let's understand why traditional deployment (stop → deploy → start) no longer cuts it:

graph LR
    A[🔴 Traditional Deploy] --> B[Stop Server]
    B --> C[Copy Files]
    C --> D[Run Migrations]
    D --> E[Start Server]
    E --> F[Health Check]

    G[🟢 Zero-Downtime] --> H[Deploy New Version]
    H --> I[Health Check New]
    I --> J[Switch Traffic]
    J --> K[Drain Old Connections]

    style A fill:#e94560,stroke:#fff,color:#fff
    style G fill:#4CAF50,stroke:#fff,color:#fff
    style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style C fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style F fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style H fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style I fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style J fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style K fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
  
Traditional deploy vs zero-downtime

In traditional deployment, the gap between "Stop Server" and a successful health check can span seconds to minutes. Every request during that window fails. Zero-downtime keeps at least one healthy version serving traffic at all times.

Prerequisites

To apply any zero-downtime strategy, the application must satisfy: (1) Backward-compatible database migrations — the new and old versions must both read/write the DB, (2) Health check endpoints — the load balancer needs to know which instances are healthy, (3) Graceful shutdown — finish in-flight requests before stopping, (4) Stateless applications — or sticky sessions if state is unavoidable.

How it works

Rolling Update replaces old instances with new ones, either one at a time or in batches. At any moment there's always enough capacity serving traffic (minus whatever is being upgraded).

sequenceDiagram
    participant LB as Load Balancer
    participant P1 as Pod 1 (v1)
    participant P2 as Pod 2 (v1)
    participant P3 as Pod 3 (v1)

    Note over LB,P3: Step 1: Start Rolling Update
    LB->>P1: Drain connections
    Note over P1: Terminate v1, Start v2
    P1-->>LB: Health check OK (v2)

    Note over LB,P3: Step 2: Continue with the next pod
    LB->>P2: Drain connections
    Note over P2: Terminate v1, Start v2
    P2-->>LB: Health check OK (v2)

    Note over LB,P3: Step 3: Final pod
    LB->>P3: Drain connections
    Note over P3: Terminate v1, Start v2
    P3-->>LB: Health check OK (v2)

    Note over LB,P3: ✅ Complete — every pod runs v2
  
Sequential Rolling Update across 3 pods

Kubernetes configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Add at most 1 new pod at a time
      maxUnavailable: 1   # At most 1 pod unavailable
  template:
    spec:
      containers:
      - name: api
        image: myapp:2.0
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]
      terminationGracePeriodSeconds: 30

Key tip: maxSurge vs maxUnavailable

maxSurge: 25% + maxUnavailable: 0 is the safest setting — 100% capacity is always preserved. The trade-off is extra resource overhead for surge pods. If the cluster has resource constraints, use maxSurge: 0 + maxUnavailable: 1 to save resources but temporarily reduce capacity.

✅ Pros

  • Simple configuration, default in Kubernetes
  • No need to double the infrastructure
  • Automatic rollback on health check failures
  • Fits most stateless applications

⚠️ Cons

  • During the update, two versions run in parallel
  • No instant rollback — you must roll forward again
  • Hard to control the % of traffic going to the new version
  • Database migrations must be backward-compatible

3. Blue-Green Deployment — Instant switch, one-second rollback

How it works

Maintain two identical production environments: Blue (live) and Green (idle). Deploy the new version to Green, test thoroughly, then switch 100% of the traffic to Green. Blue becomes the backup — rollback is just a traffic switch back.

graph TB
    subgraph "Before deploy"
        U1[Users] --> LB1[Load Balancer]
        LB1 --> B1[🔵 Blue - v1.0
ACTIVE] G1[🟢 Green - v0.9
IDLE] end subgraph "Deploy v2.0 to Green" U2[Users] --> LB2[Load Balancer] LB2 --> B2[🔵 Blue - v1.0
ACTIVE] G2[🟢 Green - v2.0
TESTING] end subgraph "Switch traffic" U3[Users] --> LB3[Load Balancer] B3[🔵 Blue - v1.0
STANDBY] LB3 --> G3[🟢 Green - v2.0
ACTIVE] end style B1 fill:#2196F3,stroke:#fff,color:#fff style G1 fill:#f8f9fa,stroke:#e0e0e0,color:#888 style B2 fill:#2196F3,stroke:#fff,color:#fff style G2 fill:#4CAF50,stroke:#fff,color:#fff style B3 fill:#f8f9fa,stroke:#e0e0e0,color:#888 style G3 fill:#4CAF50,stroke:#fff,color:#fff style LB1 fill:#2c3e50,stroke:#fff,color:#fff style LB2 fill:#2c3e50,stroke:#fff,color:#fff style LB3 fill:#2c3e50,stroke:#fff,color:#fff style U1 fill:#e94560,stroke:#fff,color:#fff style U2 fill:#e94560,stroke:#fff,color:#fff style U3 fill:#e94560,stroke:#fff,color:#fff
Blue-Green Deployment through three phases

Implementing with AWS ECS

{
  "deploymentController": {
    "type": "CODE_DEPLOY"
  },
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }
}

AWS CodeDeploy with ECS natively supports Blue-Green. Configure it via appspec.yaml:

version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: "arn:aws:ecs:region:account:task-def/myapp:2"
        LoadBalancerInfo:
          ContainerName: "web-api"
          ContainerPort: 8080
Hooks:
  - BeforeAllowTraffic: "LambdaFunctionForValidation"
  - AfterAllowTraffic: "LambdaFunctionForSmokeTest"

Implementing with Nginx (self-managed)

upstream blue {
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
}

upstream green {
    server 10.0.2.10:8080;
    server 10.0.2.11:8080;
}

# Use map to switch traffic
map $cookie_deployment $backend {
    default blue;       # ← change to "green" when switching
    "green" green;
}

server {
    listen 80;
    location / {
        proxy_pass http://$backend;
        proxy_set_header X-Deployment-Target $backend;
    }
}

Double the infrastructure cost

Blue-Green requires two identical production environments. On the cloud you can optimize by spinning Green up only when deploying, then tearing it down after confirming stability (usually 1-24 hours). AWS ECS and Azure Container Apps support this model natively.

✅ Pros

  • Instant rollback (just switch traffic back)
  • Production-like testing before the switch
  • No mixed versions in production
  • Conceptually simple

⚠️ Cons

  • Double infrastructure cost
  • Database migrations are tricky (both envs share the DB)
  • No gradual rollout — 100% of traffic switches at once
  • Session persistence must be handled carefully

4. Canary Release — Safe deploys with fine-grained traffic control

How it works

The term "canary" comes from miners carrying canaries to detect toxic gases early. Similarly, a Canary Release sends a small fraction of traffic (1-5%) to the new version. If metrics look healthy → ramp it up. If something breaks → roll back immediately, affecting only a few users.

graph LR
    U[Users
100%] --> LB[Load Balancer
Traffic Split] LB -->|95%| V1[Version 1.0
Stable] LB -->|5%| V2[Version 2.0
Canary] V2 --> M[Metrics
Monitoring] M -->|OK| INC[Ramp to 25%
→ 50% → 100%] M -->|Error| RB[Rollback
0% canary] style U fill:#e94560,stroke:#fff,color:#fff style LB fill:#2c3e50,stroke:#fff,color:#fff style V1 fill:#2196F3,stroke:#fff,color:#fff style V2 fill:#4CAF50,stroke:#fff,color:#fff style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style INC fill:#4CAF50,stroke:#fff,color:#fff style RB fill:#e94560,stroke:#fff,color:#fff
Canary Release flow with traffic splitting and monitoring

Canary configuration on Kubernetes with Istio

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-api
spec:
  hosts:
  - web-api.example.com
  http:
  - route:
    - destination:
        host: web-api
        subset: stable
      weight: 95
    - destination:
        host: web-api
        subset: canary
      weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: web-api
spec:
  host: web-api
  subsets:
  - name: stable
    labels:
      version: v1
  - name: canary
    labels:
      version: v2

Automated Canary with Flagger

Flagger is a progressive delivery tool for Kubernetes that automates the entire canary process based on metrics:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: web-api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  service:
    port: 8080
  analysis:
    # Bump by 10% each interval if metrics pass
    interval: 1m
    threshold: 5          # Failures before rollback
    maxWeight: 50         # Max 50% traffic on canary
    stepWeight: 10        # 10% increments
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99            # Require 99%+ success rate
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500           # P99 latency < 500ms
      interval: 1m
    webhooks:
    - name: smoke-test
      type: pre-rollout
      url: http://flagger-loadtester/
      metadata:
        cmd: "curl -s http://web-api-canary:8080/health"

Canary metrics to watch

The four most important metrics for canary analysis (per the DORA framework): Error Rate — the 5xx/4xx rate vs stable, Latency P99 — must match or beat stable, Throughput — canary handles equivalent traffic per instance, Saturation — no abnormal CPU/memory spikes. If any of them breaches the threshold → automatic rollback.

✅ Pros

  • Low risk — only a small % of users affected
  • Catch production bugs early under real traffic
  • Auto-rollback driven by metrics
  • Fits large-scale systems

⚠️ Cons

  • Requires a service mesh or advanced load balancer
  • Setting up the monitoring pipeline is complex
  • Deploys are slower (you wait at each step)
  • Two versions live concurrently — DB migrations must be compatible

5. A/B Testing Deployment — Deploy by user segment

How it works

Unlike Canary (random traffic split), A/B Testing Deployment routes based on user attributes: location, device, user ID, subscription tier, and so on. The goal is both safe deployment and measuring the impact of the new feature.

# Istio VirtualService with header-based routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-api
spec:
  hosts:
  - web-api.example.com
  http:
  # Route the internal team to the new version
  - match:
    - headers:
        x-user-tier:
          exact: "internal"
    route:
    - destination:
        host: web-api
        subset: v2
  # Route premium users
  - match:
    - headers:
        x-user-tier:
          exact: "premium"
    route:
    - destination:
        host: web-api
        subset: v2
  # Everyone else → stable
  - route:
    - destination:
        host: web-api
        subset: v1

6. Side-by-side comparison

Criterion Rolling Update Blue-Green Canary Release A/B Testing
Setup complexity ⭐ Low ⭐⭐ Medium ⭐⭐⭐ High ⭐⭐⭐ High
Rollback speed Slow (minutes) Instant (seconds) Fast (seconds) Fast (seconds)
Infrastructure cost Lowest Double +10-50% +10-50%
Blast radius on failure ~25-50% users 100% or 0% 1-10% users Specific segment
Mixed versions Yes (temporarily) No Yes Yes
Needs Service Mesh No No Recommended Required
Best fit for Most applications Critical systems, infrequent deploys Large-scale, frequent deploys Feature experiments

7. Database migrations in a zero-downtime world

This is the hardest part. When two versions run concurrently (Rolling, Canary), both v1 and v2 must be compatible with the same database schema. The golden rule: the Expand-and-Contract pattern.

graph TB
    subgraph "Phase 1: Expand"
        E1[Deploy v2 code] --> E2[Add new column
nullable/default] E2 --> E3[v1 and v2 both work
v2 starts writing the new column] end subgraph "Phase 2: Migrate" M1[Backfill data
for the new column] --> M2[v2 reads/writes the new column] M2 --> M3[Retire v1 completely] end subgraph "Phase 3: Contract" C1[Drop the old column] --> C2[Remove legacy code] C2 --> C3[Schema clean] end E3 --> M1 M3 --> C1 style E1 fill:#2196F3,stroke:#fff,color:#fff style E2 fill:#2196F3,stroke:#fff,color:#fff style E3 fill:#2196F3,stroke:#fff,color:#fff style M1 fill:#4CAF50,stroke:#fff,color:#fff style M2 fill:#4CAF50,stroke:#fff,color:#fff style M3 fill:#4CAF50,stroke:#fff,color:#fff style C1 fill:#e94560,stroke:#fff,color:#fff style C2 fill:#e94560,stroke:#fff,color:#fff style C3 fill:#e94560,stroke:#fff,color:#fff
Expand-and-Contract pattern for safe database migrations

Example: renaming a column

Want to rename user_name to display_name? You can't just ALTER TABLE RENAME COLUMN — v1 will crash immediately. Do it this way instead:

-- Phase 1: Expand (deploy before the code change)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Trigger to keep the two columns in sync during the transition
CREATE TRIGGER sync_display_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW
BEGIN
    IF NEW.display_name IS NULL THEN
        SET NEW.display_name = NEW.user_name;
    END IF;
    IF NEW.user_name IS NULL THEN
        SET NEW.user_name = NEW.display_name;
    END IF;
END;

-- Phase 2: Backfill existing data
UPDATE users SET display_name = user_name WHERE display_name IS NULL;

-- Phase 3: After v1 is fully retired
ALTER TABLE users DROP COLUMN user_name;
DROP TRIGGER sync_display_name;

Never in a single deploy

Each phase is its own deploy. Phase 1 ships first and stabilizes for 1-2 days. Phase 2 finishes the backfill. Only then does Phase 3 ship the code that drops the old column. Many teams fail by cramming all three phases into a single migration script — that's the root cause of outages.

8. Health checks — the foundation under every strategy

Without good health checks, every deploy strategy is meaningless. Health checks must distinguish three states:

// .NET Minimal API health check example
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    // Liveness: is the app running at all?
    Predicate = check => check.Tags.Contains("live")
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    // Readiness: is the app ready to receive traffic?
    Predicate = check => check.Tags.Contains("ready")
});

app.MapHealthChecks("/health/startup", new HealthCheckOptions
{
    // Startup: for slow-starting containers
    Predicate = check => check.Tags.Contains("startup")
});

// Register checks
builder.Services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })
    .AddSqlServer(connectionString, tags: new[] { "ready" })
    .AddRedis(redisConnection, tags: new[] { "ready" })
    .AddCheck<WarmupCheck>("warmup", tags: new[] { "startup" });
# Kubernetes probe configuration
containers:
- name: api
  livenessProbe:
    httpGet:
      path: /health/live
      port: 8080
    initialDelaySeconds: 10
    periodSeconds: 15
    failureThreshold: 3      # 3 failures → restart container
  readinessProbe:
    httpGet:
      path: /health/ready
      port: 8080
    initialDelaySeconds: 5
    periodSeconds: 10
    failureThreshold: 2      # 2 failures → remove from Service
  startupProbe:
    httpGet:
      path: /health/startup
      port: 8080
    initialDelaySeconds: 0
    periodSeconds: 5
    failureThreshold: 30     # Allow 150s for warm-up

9. Graceful Shutdown — the step people skip

When Kubernetes sends SIGTERM to a pod, the application needs to:

sequenceDiagram
    participant K8s as Kubernetes
    participant Pod as Application Pod
    participant LB as Service/Endpoint

    K8s->>Pod: SIGTERM
    K8s->>LB: Remove Pod from Endpoints
    Note over Pod: Start graceful shutdown
    Pod->>Pod: Stop accepting NEW requests
    Pod->>Pod: Finish in-flight requests (max 30s)
    Pod->>Pod: Close DB connections
    Pod->>Pod: Flush logs/metrics
    Pod-->>K8s: Exit 0

    Note over K8s: If it doesn't exit within
terminationGracePeriodSeconds
→ SIGKILL
Graceful shutdown sequence on Kubernetes
// .NET graceful shutdown
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

var lifetime = app.Services.GetRequiredService<IHostApplicationLifetime>();

lifetime.ApplicationStopping.Register(() =>
{
    // Wait for in-flight requests to complete
    // The preStop hook's sleep 15 ensures the endpoint has been removed
    Log.Information("Shutting down gracefully...");
});

lifetime.ApplicationStopped.Register(() =>
{
    Log.CloseAndFlush();
});

preStop hook: why the sleep?

Kubernetes sends SIGTERM and removes the endpoint simultaneously, not sequentially. That means for a few seconds after SIGTERM, the load balancer can still send traffic to a shutting-down pod. preStop: sleep 15 gives enough time for endpoint propagation to complete before the app actually starts shutting down.

10. Which strategy is right for your team?

Startup / small team (1-5 devs)
→ Rolling Update. Simple, effective, default on Kubernetes. Focus on writing good health checks and graceful shutdown. Don't over-engineer before you have real traffic.
Scale-up / critical systems
→ Blue-Green. When downtime is unacceptable (fintech, healthcare, large e-commerce). Infrastructure cost goes up, but instant rollback is a trade-off worth making. AWS ECS and Azure Container Apps support it natively.
Enterprise / high traffic (>10K RPM)
→ Canary Release. When a small blast radius is the top priority. Invest in a service mesh (Istio/Linkerd) and an observability stack. Flagger + Prometheus automate the entire pipeline.
Product-led / data-driven teams
→ A/B + Canary combined. When you need to measure the business impact of every deploy. Feature flags (OpenFeature, LaunchDarkly) combined with Canary give maximum control.

Mixing in practice

Most production systems don't use a single strategy. A common pattern: Rolling Update for small config changes + Canary for major feature releases + Blue-Green for infrastructure upgrades (Kubernetes version, database engine). Match the strategy to the risk level of each change, rather than applying one formula to everything.

References