Zero-Downtime Deployment — Blue-Green, Canary, and Rolling Update

Posted on: 4/18/2026 12:11:02 AM

Table of contents

1. Why do we need Zero-Downtime Deployment?
1. Prerequisites
2. Rolling Update — The simplest and most popular option
1. How it works
2. Kubernetes configuration
3. Blue-Green Deployment — Instant switch, one-second rollback
4. Canary Release — Safe deploys with fine-grained traffic control
5. A/B Testing Deployment — Deploy by user segment
1. How it works
6. Side-by-side comparison
7. Database migrations in a zero-downtime world
1. Example: renaming a column
  1. Never in a single deploy
8. Health checks — the foundation under every strategy
9. Graceful Shutdown — the step people skip
1. preStop hook: why the sleep?
10. Which strategy is right for your team?
1. Mixing in practice
References

Every deploy is nerve-wracking. A 5-minute outage can mean thousands of lost orders, serious revenue damage, and — even worse — lost user trust. In a world where 99.99% uptime (only ~52 minutes of downtime per year) is the minimum expectation, understanding and correctly applying zero-downtime deployment strategies is no longer "nice to have" — it's a hard requirement.

$5,600 Average cost per minute of downtime (Gartner)

52 min Max annual downtime for a 99.99% SLA

46× More deploys among Elite DevOps teams (DORA)

7,200+ Deploys per day at Amazon (average)

This article goes deep on the four most common deployment strategies: Rolling Update, Blue-Green Deployment, Canary Release, and A/B Testing Deployment. Each fits different situations — there's no "one size fits all".

1. Why do we need Zero-Downtime Deployment?

Before diving into each strategy, let's understand why traditional deployment (stop → deploy → start) no longer cuts it:

graph LR
    A[🔴 Traditional Deploy] --> B[Stop Server]
    B --> C[Copy Files]
    C --> D[Run Migrations]
    D --> E[Start Server]
    E --> F[Health Check]

    G[🟢 Zero-Downtime] --> H[Deploy New Version]
    H --> I[Health Check New]
    I --> J[Switch Traffic]
    J --> K[Drain Old Connections]

    style A fill:#e94560,stroke:#fff,color:#fff
    style G fill:#4CAF50,stroke:#fff,color:#fff
    style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style C fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style F fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style H fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style I fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style J fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style K fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

Traditional deploy vs zero-downtime

In traditional deployment, the gap between "Stop Server" and a successful health check can span seconds to minutes. Every request during that window fails. Zero-downtime keeps at least one healthy version serving traffic at all times.

Prerequisites

To apply any zero-downtime strategy, the application must satisfy: (1) Backward-compatible database migrations — the new and old versions must both read/write the DB, (2) Health check endpoints — the load balancer needs to know which instances are healthy, (3) Graceful shutdown — finish in-flight requests before stopping, (4) Stateless applications — or sticky sessions if state is unavoidable.

2. Rolling Update — The simplest and most popular option

How it works

Rolling Update replaces old instances with new ones, either one at a time or in batches. At any moment there's always enough capacity serving traffic (minus whatever is being upgraded).

sequenceDiagram
    participant LB as Load Balancer
    participant P1 as Pod 1 (v1)
    participant P2 as Pod 2 (v1)
    participant P3 as Pod 3 (v1)

    Note over LB,P3: Step 1: Start Rolling Update
    LB->>P1: Drain connections
    Note over P1: Terminate v1, Start v2
    P1-->>LB: Health check OK (v2)

    Note over LB,P3: Step 2: Continue with the next pod
    LB->>P2: Drain connections
    Note over P2: Terminate v1, Start v2
    P2-->>LB: Health check OK (v2)

    Note over LB,P3: Step 3: Final pod
    LB->>P3: Drain connections
    Note over P3: Terminate v1, Start v2
    P3-->>LB: Health check OK (v2)

    Note over LB,P3: ✅ Complete — every pod runs v2

Sequential Rolling Update across 3 pods

Kubernetes configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Add at most 1 new pod at a time
      maxUnavailable: 1   # At most 1 pod unavailable
  template:
    spec:
      containers:
      - name: api
        image: myapp:2.0
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]
      terminationGracePeriodSeconds: 30

Key tip: maxSurge vs maxUnavailable

maxSurge: 25% + maxUnavailable: 0 is the safest setting — 100% capacity is always preserved. The trade-off is extra resource overhead for surge pods. If the cluster has resource constraints, use maxSurge: 0 + maxUnavailable: 1 to save resources but temporarily reduce capacity.

✅ Pros

Simple configuration, default in Kubernetes
No need to double the infrastructure
Automatic rollback on health check failures
Fits most stateless applications

⚠️ Cons

During the update, two versions run in parallel
No instant rollback — you must roll forward again
Hard to control the % of traffic going to the new version
Database migrations must be backward-compatible

3. Blue-Green Deployment — Instant switch, one-second rollback

How it works

Maintain two identical production environments: Blue (live) and Green (idle). Deploy the new version to Green, test thoroughly, then switch 100% of the traffic to Green. Blue becomes the backup — rollback is just a traffic switch back.

graph TB
    subgraph "Before deploy"
        U1[Users] --> LB1[Load Balancer]
        LB1 --> B1[🔵 Blue - v1.0
ACTIVE]
        G1[🟢 Green - v0.9
IDLE]
    end

    subgraph "Deploy v2.0 to Green"
        U2[Users] --> LB2[Load Balancer]
        LB2 --> B2[🔵 Blue - v1.0
ACTIVE]
        G2[🟢 Green - v2.0
TESTING]
    end

    subgraph "Switch traffic"
        U3[Users] --> LB3[Load Balancer]
        B3[🔵 Blue - v1.0
STANDBY]
        LB3 --> G3[🟢 Green - v2.0
ACTIVE]
    end

    style B1 fill:#2196F3,stroke:#fff,color:#fff
    style G1 fill:#f8f9fa,stroke:#e0e0e0,color:#888
    style B2 fill:#2196F3,stroke:#fff,color:#fff
    style G2 fill:#4CAF50,stroke:#fff,color:#fff
    style B3 fill:#f8f9fa,stroke:#e0e0e0,color:#888
    style G3 fill:#4CAF50,stroke:#fff,color:#fff
    style LB1 fill:#2c3e50,stroke:#fff,color:#fff
    style LB2 fill:#2c3e50,stroke:#fff,color:#fff
    style LB3 fill:#2c3e50,stroke:#fff,color:#fff
    style U1 fill:#e94560,stroke:#fff,color:#fff
    style U2 fill:#e94560,stroke:#fff,color:#fff
    style U3 fill:#e94560,stroke:#fff,color:#fff

Blue-Green Deployment through three phases

Implementing with AWS ECS

{
  "deploymentController": {
    "type": "CODE_DEPLOY"
  },
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }
}

AWS CodeDeploy with ECS natively supports Blue-Green. Configure it via appspec.yaml:

version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: "arn:aws:ecs:region:account:task-def/myapp:2"
        LoadBalancerInfo:
          ContainerName: "web-api"
          ContainerPort: 8080
Hooks:
  - BeforeAllowTraffic: "LambdaFunctionForValidation"
  - AfterAllowTraffic: "LambdaFunctionForSmokeTest"

Implementing with Nginx (self-managed)

upstream blue {
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
}

upstream green {
    server 10.0.2.10:8080;
    server 10.0.2.11:8080;
}

# Use map to switch traffic
map $cookie_deployment $backend {
    default blue;       # ← change to "green" when switching
    "green" green;
}

server {
    listen 80;
    location / {
        proxy_pass http://$backend;
        proxy_set_header X-Deployment-Target $backend;
    }
}

Double the infrastructure cost

Blue-Green requires two identical production environments. On the cloud you can optimize by spinning Green up only when deploying, then tearing it down after confirming stability (usually 1-24 hours). AWS ECS and Azure Container Apps support this model natively.

✅ Pros

Instant rollback (just switch traffic back)
Production-like testing before the switch
No mixed versions in production
Conceptually simple

⚠️ Cons

Double infrastructure cost
Database migrations are tricky (both envs share the DB)
No gradual rollout — 100% of traffic switches at once
Session persistence must be handled carefully

4. Canary Release — Safe deploys with fine-grained traffic control

How it works

The term "canary" comes from miners carrying canaries to detect toxic gases early. Similarly, a Canary Release sends a small fraction of traffic (1-5%) to the new version. If metrics look healthy → ramp it up. If something breaks → roll back immediately, affecting only a few users.

graph LR
    U[Users
100%] --> LB[Load Balancer
Traffic Split]
    LB -->|95%| V1[Version 1.0
Stable]
    LB -->|5%| V2[Version 2.0
Canary]
    V2 --> M[Metrics
Monitoring]
    M -->|OK| INC[Ramp to 25%
→ 50% → 100%]
    M -->|Error| RB[Rollback
0% canary]

    style U fill:#e94560,stroke:#fff,color:#fff
    style LB fill:#2c3e50,stroke:#fff,color:#fff
    style V1 fill:#2196F3,stroke:#fff,color:#fff
    style V2 fill:#4CAF50,stroke:#fff,color:#fff
    style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style INC fill:#4CAF50,stroke:#fff,color:#fff
    style RB fill:#e94560,stroke:#fff,color:#fff

Canary Release flow with traffic splitting and monitoring

Canary configuration on Kubernetes with Istio

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-api
spec:
  hosts:
  - web-api.example.com
  http:
  - route:
    - destination:
        host: web-api
        subset: stable
      weight: 95
    - destination:
        host: web-api
        subset: canary
      weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: web-api
spec:
  host: web-api
  subsets:
  - name: stable
    labels:
      version: v1
  - name: canary
    labels:
      version: v2

Automated Canary with Flagger

Flagger is a progressive delivery tool for Kubernetes that automates the entire canary process based on metrics:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: web-api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  service:
    port: 8080
  analysis:
    # Bump by 10% each interval if metrics pass
    interval: 1m
    threshold: 5          # Failures before rollback
    maxWeight: 50         # Max 50% traffic on canary
    stepWeight: 10        # 10% increments
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99            # Require 99%+ success rate
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500           # P99 latency < 500ms
      interval: 1m
    webhooks:
    - name: smoke-test
      type: pre-rollout
      url: http://flagger-loadtester/
      metadata:
        cmd: "curl -s http://web-api-canary:8080/health"

Canary metrics to watch

The four most important metrics for canary analysis (per the DORA framework): Error Rate — the 5xx/4xx rate vs stable, Latency P99 — must match or beat stable, Throughput — canary handles equivalent traffic per instance, Saturation — no abnormal CPU/memory spikes. If any of them breaches the threshold → automatic rollback.

✅ Pros

Low risk — only a small % of users affected
Catch production bugs early under real traffic
Auto-rollback driven by metrics
Fits large-scale systems

⚠️ Cons

Requires a service mesh or advanced load balancer
Setting up the monitoring pipeline is complex
Deploys are slower (you wait at each step)
Two versions live concurrently — DB migrations must be compatible

5. A/B Testing Deployment — Deploy by user segment

How it works

Unlike Canary (random traffic split), A/B Testing Deployment routes based on user attributes: location, device, user ID, subscription tier, and so on. The goal is both safe deployment and measuring the impact of the new feature.

# Istio VirtualService with header-based routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-api
spec:
  hosts:
  - web-api.example.com
  http:
  # Route the internal team to the new version
  - match:
    - headers:
        x-user-tier:
          exact: "internal"
    route:
    - destination:
        host: web-api
        subset: v2
  # Route premium users
  - match:
    - headers:
        x-user-tier:
          exact: "premium"
    route:
    - destination:
        host: web-api
        subset: v2
  # Everyone else → stable
  - route:
    - destination:
        host: web-api
        subset: v1

6. Side-by-side comparison

Criterion	Rolling Update	Blue-Green	Canary Release	A/B Testing
Setup complexity	⭐ Low	⭐⭐ Medium	⭐⭐⭐ High	⭐⭐⭐ High
Rollback speed	Slow (minutes)	Instant (seconds)	Fast (seconds)	Fast (seconds)
Infrastructure cost	Lowest	Double	+10-50%	+10-50%
Blast radius on failure	~25-50% users	100% or 0%	1-10% users	Specific segment
Mixed versions	Yes (temporarily)	No	Yes	Yes
Needs Service Mesh	No	No	Recommended	Required
Best fit for	Most applications	Critical systems, infrequent deploys	Large-scale, frequent deploys	Feature experiments

7. Database migrations in a zero-downtime world

This is the hardest part. When two versions run concurrently (Rolling, Canary), both v1 and v2 must be compatible with the same database schema. The golden rule: the Expand-and-Contract pattern.

graph TB
    subgraph "Phase 1: Expand"
        E1[Deploy v2 code] --> E2[Add new column
nullable/default]
        E2 --> E3[v1 and v2 both work
v2 starts writing the new column]
    end

    subgraph "Phase 2: Migrate"
        M1[Backfill data
for the new column] --> M2[v2 reads/writes the new column]
        M2 --> M3[Retire v1 completely]
    end

    subgraph "Phase 3: Contract"
        C1[Drop the old column] --> C2[Remove legacy code]
        C2 --> C3[Schema clean]
    end

    E3 --> M1
    M3 --> C1

    style E1 fill:#2196F3,stroke:#fff,color:#fff
    style E2 fill:#2196F3,stroke:#fff,color:#fff
    style E3 fill:#2196F3,stroke:#fff,color:#fff
    style M1 fill:#4CAF50,stroke:#fff,color:#fff
    style M2 fill:#4CAF50,stroke:#fff,color:#fff
    style M3 fill:#4CAF50,stroke:#fff,color:#fff
    style C1 fill:#e94560,stroke:#fff,color:#fff
    style C2 fill:#e94560,stroke:#fff,color:#fff
    style C3 fill:#e94560,stroke:#fff,color:#fff

Expand-and-Contract pattern for safe database migrations

Example: renaming a column

Want to rename user_name to display_name? You can't just ALTER TABLE RENAME COLUMN — v1 will crash immediately. Do it this way instead:

-- Phase 1: Expand (deploy before the code change)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Trigger to keep the two columns in sync during the transition
CREATE TRIGGER sync_display_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW
BEGIN
    IF NEW.display_name IS NULL THEN
        SET NEW.display_name = NEW.user_name;
    END IF;
    IF NEW.user_name IS NULL THEN
        SET NEW.user_name = NEW.display_name;
    END IF;
END;

-- Phase 2: Backfill existing data
UPDATE users SET display_name = user_name WHERE display_name IS NULL;

-- Phase 3: After v1 is fully retired
ALTER TABLE users DROP COLUMN user_name;
DROP TRIGGER sync_display_name;

Never in a single deploy

Each phase is its own deploy. Phase 1 ships first and stabilizes for 1-2 days. Phase 2 finishes the backfill. Only then does Phase 3 ship the code that drops the old column. Many teams fail by cramming all three phases into a single migration script — that's the root cause of outages.

8. Health checks — the foundation under every strategy

Without good health checks, every deploy strategy is meaningless. Health checks must distinguish three states:

// .NET Minimal API health check example
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    // Liveness: is the app running at all?
    Predicate = check => check.Tags.Contains("live")
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    // Readiness: is the app ready to receive traffic?
    Predicate = check => check.Tags.Contains("ready")
});

app.MapHealthChecks("/health/startup", new HealthCheckOptions
{
    // Startup: for slow-starting containers
    Predicate = check => check.Tags.Contains("startup")
});

// Register checks
builder.Services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })
    .AddSqlServer(connectionString, tags: new[] { "ready" })
    .AddRedis(redisConnection, tags: new[] { "ready" })
    .AddCheck<WarmupCheck>("warmup", tags: new[] { "startup" });

# Kubernetes probe configuration
containers:
- name: api
  livenessProbe:
    httpGet:
      path: /health/live
      port: 8080
    initialDelaySeconds: 10
    periodSeconds: 15
    failureThreshold: 3      # 3 failures → restart container
  readinessProbe:
    httpGet:
      path: /health/ready
      port: 8080
    initialDelaySeconds: 5
    periodSeconds: 10
    failureThreshold: 2      # 2 failures → remove from Service
  startupProbe:
    httpGet:
      path: /health/startup
      port: 8080
    initialDelaySeconds: 0
    periodSeconds: 5
    failureThreshold: 30     # Allow 150s for warm-up

9. Graceful Shutdown — the step people skip

When Kubernetes sends SIGTERM to a pod, the application needs to:

sequenceDiagram
    participant K8s as Kubernetes
    participant Pod as Application Pod
    participant LB as Service/Endpoint

    K8s->>Pod: SIGTERM
    K8s->>LB: Remove Pod from Endpoints
    Note over Pod: Start graceful shutdown
    Pod->>Pod: Stop accepting NEW requests
    Pod->>Pod: Finish in-flight requests (max 30s)
    Pod->>Pod: Close DB connections
    Pod->>Pod: Flush logs/metrics
    Pod-->>K8s: Exit 0

    Note over K8s: If it doesn't exit within
terminationGracePeriodSeconds
→ SIGKILL

Graceful shutdown sequence on Kubernetes

// .NET graceful shutdown
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

var lifetime = app.Services.GetRequiredService<IHostApplicationLifetime>();

lifetime.ApplicationStopping.Register(() =>
{
    // Wait for in-flight requests to complete
    // The preStop hook's sleep 15 ensures the endpoint has been removed
    Log.Information("Shutting down gracefully...");
});

lifetime.ApplicationStopped.Register(() =>
{
    Log.CloseAndFlush();
});

preStop hook: why the sleep?

Kubernetes sends SIGTERM and removes the endpoint simultaneously, not sequentially. That means for a few seconds after SIGTERM, the load balancer can still send traffic to a shutting-down pod. preStop: sleep 15 gives enough time for endpoint propagation to complete before the app actually starts shutting down.

10. Which strategy is right for your team?

Startup / small team (1-5 devs)

→ Rolling Update. Simple, effective, default on Kubernetes. Focus on writing good health checks and graceful shutdown. Don't over-engineer before you have real traffic.

Scale-up / critical systems

→ Blue-Green. When downtime is unacceptable (fintech, healthcare, large e-commerce). Infrastructure cost goes up, but instant rollback is a trade-off worth making. AWS ECS and Azure Container Apps support it natively.

Enterprise / high traffic (>10K RPM)

→ Canary Release. When a small blast radius is the top priority. Invest in a service mesh (Istio/Linkerd) and an observability stack. Flagger + Prometheus automate the entire pipeline.

Product-led / data-driven teams

→ A/B + Canary combined. When you need to measure the business impact of every deploy. Feature flags (OpenFeature, LaunchDarkly) combined with Canary give maximum control.

Mixing in practice

Most production systems don't use a single strategy. A common pattern: Rolling Update for small config changes + Canary for major feature releases + Blue-Green for infrastructure upgrades (Kubernetes version, database engine). Match the strategy to the risk level of each change, rather than applying one formula to everything.

References

#Zero-Downtime Deployment #Blue-Green Deployment #Canary Release #Rolling Update #Kubernetes #system design #CI/CD #DevOps

# Zero-Downtime Deployment — Blue-Green, Canary, and Rolling Update

Every deploy is nerve-wracking. A 5-minute outage can mean thousands of lost orders, serious revenue damage, and — even worse — lost user trust. In a world where 99.99% uptime (only ~52 minutes of downtime per year) is the minimum expectation, understanding and correctly applying **zero-downtime deployment** strategies is no longer "nice to have" — it's a hard requirement.

$5,600 Average cost per minute of downtime (Gartner)

52 min Max annual downtime for a 99.99% SLA

46× More deploys among Elite DevOps teams (DORA)

7,200+ Deploys per day at Amazon (average)

This article goes deep on the four most common deployment strategies: **Rolling Update**, **Blue-Green Deployment**, **Canary Release**, and **A/B Testing Deployment**. Each fits different situations — there's no "one size fits all".

## 1. Why do we need Zero-Downtime Deployment?

Before diving into each strategy, let's understand why traditional deployment (stop → deploy → start) no longer cuts it:

```
graph LR
    A[🔴 Traditional Deploy] --> B[Stop Server]
    B --> C[Copy Files]
    C --> D[Run Migrations]
    D --> E[Start Server]
    E --> F[Health Check]

G[🟢 Zero-Downtime] --> H[Deploy New Version]
    H --> I[Health Check New]
    I --> J[Switch Traffic]
    J --> K[Drain Old Connections]

style A fill:#e94560,stroke:#fff,color:#fff
    style G fill:#4CAF50,stroke:#fff,color:#fff
    style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style C fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style F fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style H fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style I fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style J fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style K fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
  
```

Traditional deploy vs zero-downtime

#### Prerequisites

To apply any zero-downtime strategy, the application must satisfy: (1) **Backward-compatible database migrations** — the new and old versions must both read/write the DB, (2) **Health check endpoints** — the load balancer needs to know which instances are healthy, (3) **Graceful shutdown** — finish in-flight requests before stopping, (4) **Stateless applications** — or sticky sessions if state is unavoidable.

## 2. Rolling Update — The simplest and most popular option

### How it works

Rolling Update replaces old instances with new ones, either one at a time or in batches. At any moment there's always enough capacity serving traffic (minus whatever is being upgraded).

```
sequenceDiagram
    participant LB as Load Balancer
    participant P1 as Pod 1 (v1)
    participant P2 as Pod 2 (v1)
    participant P3 as Pod 3 (v1)

Note over LB,P3: Step 1: Start Rolling Update
    LB->>P1: Drain connections
    Note over P1: Terminate v1, Start v2
    P1-->>LB: Health check OK (v2)

Note over LB,P3: Step 2: Continue with the next pod
    LB->>P2: Drain connections
    Note over P2: Terminate v1, Start v2
    P2-->>LB: Health check OK (v2)

Note over LB,P3: Step 3: Final pod
    LB->>P3: Drain connections
    Note over P3: Terminate v1, Start v2
    P3-->>LB: Health check OK (v2)

Note over LB,P3: ✅ Complete — every pod runs v2
  
```

Sequential Rolling Update across 3 pods

### Kubernetes configuration

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
spec:
  replicas: 4
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Add at most 1 new pod at a time
      maxUnavailable: 1   # At most 1 pod unavailable
  template:
    spec:
      containers:
      - name: api
        image: myapp:2.0
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 15"]
      terminationGracePeriodSeconds: 30
```

#### Key tip: maxSurge vs maxUnavailable

`maxSurge: 25%` + `maxUnavailable: 0` is the safest setting — 100% capacity is always preserved. The trade-off is extra resource overhead for surge pods. If the cluster has resource constraints, use `maxSurge: 0` + `maxUnavailable: 1` to save resources but temporarily reduce capacity.

#### ✅ Pros

- Simple configuration, default in Kubernetes
- No need to double the infrastructure
- Automatic rollback on health check failures
- Fits most stateless applications

#### ⚠️ Cons

- During the update, two versions run in parallel
- No instant rollback — you must roll forward again
- Hard to control the % of traffic going to the new version
- Database migrations must be backward-compatible

## 3. Blue-Green Deployment — Instant switch, one-second rollback

### How it works

Maintain two identical production environments: **Blue** (live) and **Green** (idle). Deploy the new version to Green, test thoroughly, then switch 100% of the traffic to Green. Blue becomes the backup — rollback is just a traffic switch back.

```
graph TB
    subgraph "Before deploy"
        U1[Users] --> LB1[Load Balancer]
        LB1 --> B1[🔵 Blue - v1.0  
ACTIVE]
        G1[🟢 Green - v0.9  
IDLE]
    end

subgraph "Deploy v2.0 to Green"
        U2[Users] --> LB2[Load Balancer]
        LB2 --> B2[🔵 Blue - v1.0  
ACTIVE]
        G2[🟢 Green - v2.0  
TESTING]
    end

subgraph "Switch traffic"
        U3[Users] --> LB3[Load Balancer]
        B3[🔵 Blue - v1.0  
STANDBY]
        LB3 --> G3[🟢 Green - v2.0  
ACTIVE]
    end

style B1 fill:#2196F3,stroke:#fff,color:#fff
    style G1 fill:#f8f9fa,stroke:#e0e0e0,color:#888
    style B2 fill:#2196F3,stroke:#fff,color:#fff
    style G2 fill:#4CAF50,stroke:#fff,color:#fff
    style B3 fill:#f8f9fa,stroke:#e0e0e0,color:#888
    style G3 fill:#4CAF50,stroke:#fff,color:#fff
    style LB1 fill:#2c3e50,stroke:#fff,color:#fff
    style LB2 fill:#2c3e50,stroke:#fff,color:#fff
    style LB3 fill:#2c3e50,stroke:#fff,color:#fff
    style U1 fill:#e94560,stroke:#fff,color:#fff
    style U2 fill:#e94560,stroke:#fff,color:#fff
    style U3 fill:#e94560,stroke:#fff,color:#fff
  
```

Blue-Green Deployment through three phases

### Implementing with AWS ECS

```json
{
  "deploymentController": {
    "type": "CODE_DEPLOY"
  },
  "deploymentConfiguration": {
    "deploymentCircuitBreaker": {
      "enable": true,
      "rollback": true
    }
  }
}
```
AWS CodeDeploy with ECS natively supports Blue-Green. Configure it via `appspec.yaml`:

```yaml
version: 0.0
Resources:
  - TargetService:
      Type: AWS::ECS::Service
      Properties:
        TaskDefinition: "arn:aws:ecs:region:account:task-def/myapp:2"
        LoadBalancerInfo:
          ContainerName: "web-api"
          ContainerPort: 8080
Hooks:
  - BeforeAllowTraffic: "LambdaFunctionForValidation"
  - AfterAllowTraffic: "LambdaFunctionForSmokeTest"
```

### Implementing with Nginx (self-managed)

```nginx
upstream blue {
    server 10.0.1.10:8080;
    server 10.0.1.11:8080;
}

upstream green {
    server 10.0.2.10:8080;
    server 10.0.2.11:8080;
}

# Use map to switch traffic
map $cookie_deployment $backend {
    default blue;       # ← change to "green" when switching
    "green" green;
}

server {
    listen 80;
    location / {
        proxy_pass http://$backend;
        proxy_set_header X-Deployment-Target $backend;
    }
}
```

#### Double the infrastructure cost

#### ✅ Pros

- Instant rollback (just switch traffic back)
- Production-like testing before the switch
- No mixed versions in production
- Conceptually simple

#### ⚠️ Cons

- Double infrastructure cost
- Database migrations are tricky (both envs share the DB)
- No gradual rollout — 100% of traffic switches at once
- Session persistence must be handled carefully

## 4. Canary Release — Safe deploys with fine-grained traffic control

### How it works

style U fill:#e94560,stroke:#fff,color:#fff
    style LB fill:#2c3e50,stroke:#fff,color:#fff
    style V1 fill:#2196F3,stroke:#fff,color:#fff
    style V2 fill:#4CAF50,stroke:#fff,color:#fff
    style M fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style INC fill:#4CAF50,stroke:#fff,color:#fff
    style RB fill:#e94560,stroke:#fff,color:#fff
  
```

Canary Release flow with traffic splitting and monitoring

### Canary configuration on Kubernetes with Istio

```yaml
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-api
spec:
  hosts:
  - web-api.example.com
  http:
  - route:
    - destination:
        host: web-api
        subset: stable
      weight: 95
    - destination:
        host: web-api
        subset: canary
      weight: 5
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: web-api
spec:
  host: web-api
  subsets:
  - name: stable
    labels:
      version: v1
  - name: canary
    labels:
      version: v2
```

### Automated Canary with Flagger

**Flagger** is a progressive delivery tool for Kubernetes that automates the entire canary process based on metrics:

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: web-api
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-api
  service:
    port: 8080
  analysis:
    # Bump by 10% each interval if metrics pass
    interval: 1m
    threshold: 5          # Failures before rollback
    maxWeight: 50         # Max 50% traffic on canary
    stepWeight: 10        # 10% increments
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99            # Require 99%+ success rate
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500           # P99 latency < 500ms
      interval: 1m
    webhooks:
    - name: smoke-test
      type: pre-rollout
      url: http://flagger-loadtester/
      metadata:
        cmd: "curl -s http://web-api-canary:8080/health"
```

#### Canary metrics to watch

The four most important metrics for canary analysis (per the DORA framework): **Error Rate** — the 5xx/4xx rate vs stable, **Latency P99** — must match or beat stable, **Throughput** — canary handles equivalent traffic per instance, **Saturation** — no abnormal CPU/memory spikes. If any of them breaches the threshold → automatic rollback.

#### ✅ Pros

- Low risk — only a small % of users affected
- Catch production bugs early under real traffic
- Auto-rollback driven by metrics
- Fits large-scale systems

#### ⚠️ Cons

- Requires a service mesh or advanced load balancer
- Setting up the monitoring pipeline is complex
- Deploys are slower (you wait at each step)
- Two versions live concurrently — DB migrations must be compatible

## 5. A/B Testing Deployment — Deploy by user segment

### How it works

Unlike Canary (random traffic split), A/B Testing Deployment routes based on **user attributes**: location, device, user ID, subscription tier, and so on. The goal is both safe deployment and measuring the impact of the new feature.

```yaml
# Istio VirtualService with header-based routing
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: web-api
spec:
  hosts:
  - web-api.example.com
  http:
  # Route the internal team to the new version
  - match:
    - headers:
        x-user-tier:
          exact: "internal"
    route:
    - destination:
        host: web-api
        subset: v2
  # Route premium users
  - match:
    - headers:
        x-user-tier:
          exact: "premium"
    route:
    - destination:
        host: web-api
        subset: v2
  # Everyone else → stable
  - route:
    - destination:
        host: web-api
        subset: v1
```

## 6. Side-by-side comparison

| Criterion | Rolling Update | Blue-Green | Canary Release | A/B Testing |
| --- | --- | --- | --- | --- |
| **Setup complexity** | ⭐ Low | ⭐⭐ Medium | ⭐⭐⭐ High | ⭐⭐⭐ High |
| **Rollback speed** | Slow (minutes) | Instant (seconds) | Fast (seconds) | Fast (seconds) |
| **Infrastructure cost** | Lowest | Double | +10-50% | +10-50% |
| **Blast radius on failure** | ~25-50% users | 100% or 0% | 1-10% users | Specific segment |
| **Mixed versions** | Yes (temporarily) | No | Yes | Yes |
| **Needs Service Mesh** | No | No | Recommended | Required |
| **Best fit for** | Most applications | Critical systems, infrequent deploys | Large-scale, frequent deploys | Feature experiments |

## 7. Database migrations in a zero-downtime world

This is the hardest part. When two versions run concurrently (Rolling, Canary), both v1 and v2 must be compatible with the same database schema. The golden rule: **the Expand-and-Contract pattern**.

```
graph TB
    subgraph "Phase 1: Expand"
        E1[Deploy v2 code] --> E2[Add new column  
nullable/default]
        E2 --> E3[v1 and v2 both work  
v2 starts writing the new column]
    end

subgraph "Phase 2: Migrate"
        M1[Backfill data  
for the new column] --> M2[v2 reads/writes the new column]
        M2 --> M3[Retire v1 completely]
    end

subgraph "Phase 3: Contract"
        C1[Drop the old column] --> C2[Remove legacy code]
        C2 --> C3[Schema clean]
    end

E3 --> M1
    M3 --> C1

style E1 fill:#2196F3,stroke:#fff,color:#fff
    style E2 fill:#2196F3,stroke:#fff,color:#fff
    style E3 fill:#2196F3,stroke:#fff,color:#fff
    style M1 fill:#4CAF50,stroke:#fff,color:#fff
    style M2 fill:#4CAF50,stroke:#fff,color:#fff
    style M3 fill:#4CAF50,stroke:#fff,color:#fff
    style C1 fill:#e94560,stroke:#fff,color:#fff
    style C2 fill:#e94560,stroke:#fff,color:#fff
    style C3 fill:#e94560,stroke:#fff,color:#fff
  
```

Expand-and-Contract pattern for safe database migrations

### Example: renaming a column

Want to rename `user_name` to `display_name`? You can't just `ALTER TABLE RENAME COLUMN` — v1 will crash immediately. Do it this way instead:

```sql
-- Phase 1: Expand (deploy before the code change)
ALTER TABLE users ADD COLUMN display_name VARCHAR(255);

-- Trigger to keep the two columns in sync during the transition
CREATE TRIGGER sync_display_name
BEFORE INSERT OR UPDATE ON users
FOR EACH ROW
BEGIN
    IF NEW.display_name IS NULL THEN
        SET NEW.display_name = NEW.user_name;
    END IF;
    IF NEW.user_name IS NULL THEN
        SET NEW.user_name = NEW.display_name;
    END IF;
END;

-- Phase 2: Backfill existing data
UPDATE users SET display_name = user_name WHERE display_name IS NULL;

-- Phase 3: After v1 is fully retired
ALTER TABLE users DROP COLUMN user_name;
DROP TRIGGER sync_display_name;
```

#### Never in a single deploy

## 8. Health checks — the foundation under every strategy

Without good health checks, every deploy strategy is meaningless. Health checks must distinguish three states:

```csharp
// .NET Minimal API health check example
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    // Liveness: is the app running at all?
    Predicate = check => check.Tags.Contains("live")
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    // Readiness: is the app ready to receive traffic?
    Predicate = check => check.Tags.Contains("ready")
});

app.MapHealthChecks("/health/startup", new HealthCheckOptions
{
    // Startup: for slow-starting containers
    Predicate = check => check.Tags.Contains("startup")
});

// Register checks
builder.Services.AddHealthChecks()
    .AddCheck("self", () => HealthCheckResult.Healthy(), tags: new[] { "live" })
    .AddSqlServer(connectionString, tags: new[] { "ready" })
    .AddRedis(redisConnection, tags: new[] { "ready" })
    .AddCheck<WarmupCheck>("warmup", tags: new[] { "startup" });
```

```yaml
# Kubernetes probe configuration
containers:
- name: api
  livenessProbe:
    httpGet:
      path: /health/live
      port: 8080
    initialDelaySeconds: 10
    periodSeconds: 15
    failureThreshold: 3      # 3 failures → restart container
  readinessProbe:
    httpGet:
      path: /health/ready
      port: 8080
    initialDelaySeconds: 5
    periodSeconds: 10
    failureThreshold: 2      # 2 failures → remove from Service
  startupProbe:
    httpGet:
      path: /health/startup
      port: 8080
    initialDelaySeconds: 0
    periodSeconds: 5
    failureThreshold: 30     # Allow 150s for warm-up
```

## 9. Graceful Shutdown — the step people skip

When Kubernetes sends SIGTERM to a pod, the application needs to:

```
sequenceDiagram
    participant K8s as Kubernetes
    participant Pod as Application Pod
    participant LB as Service/Endpoint

K8s->>Pod: SIGTERM
    K8s->>LB: Remove Pod from Endpoints
    Note over Pod: Start graceful shutdown
    Pod->>Pod: Stop accepting NEW requests
    Pod->>Pod: Finish in-flight requests (max 30s)
    Pod->>Pod: Close DB connections
    Pod->>Pod: Flush logs/metrics
    Pod-->>K8s: Exit 0

Note over K8s: If it doesn't exit within  
terminationGracePeriodSeconds  
→ SIGKILL
  
```

Graceful shutdown sequence on Kubernetes

```csharp
// .NET graceful shutdown
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

var lifetime = app.Services.GetRequiredService<IHostApplicationLifetime>();

lifetime.ApplicationStopping.Register(() =>
{
    // Wait for in-flight requests to complete
    // The preStop hook's sleep 15 ensures the endpoint has been removed
    Log.Information("Shutting down gracefully...");
});

lifetime.ApplicationStopped.Register(() =>
{
    Log.CloseAndFlush();
});
```

#### preStop hook: why the sleep?

Kubernetes sends SIGTERM and removes the endpoint **simultaneously**, not sequentially. That means for a few seconds after SIGTERM, the load balancer can still send traffic to a shutting-down pod. `preStop: sleep 15` gives enough time for endpoint propagation to complete before the app actually starts shutting down.

## 10. Which strategy is right for your team?

Startup / small team (1-5 devs)

**→ Rolling Update**. Simple, effective, default on Kubernetes. Focus on writing good health checks and graceful shutdown. Don't over-engineer before you have real traffic.

Scale-up / critical systems

**→ Blue-Green**. When downtime is unacceptable (fintech, healthcare, large e-commerce). Infrastructure cost goes up, but instant rollback is a trade-off worth making. AWS ECS and Azure Container Apps support it natively.

Enterprise / high traffic (>10K RPM)

**→ Canary Release**. When a small blast radius is the top priority. Invest in a service mesh (Istio/Linkerd) and an observability stack. Flagger + Prometheus automate the entire pipeline.

Product-led / data-driven teams

**→ A/B + Canary combined**. When you need to measure the business impact of every deploy. Feature flags (OpenFeature, LaunchDarkly) combined with Canary give maximum control.

#### Mixing in practice

Most production systems don't use a single strategy. A common pattern: **Rolling Update** for small config changes + **Canary** for major feature releases + **Blue-Green** for infrastructure upgrades (Kubernetes version, database engine). Match the strategy to the *risk level* of each change, rather than applying one formula to everything.

## References

- [Kubernetes Documentation — Rolling Update Deployment](https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment)
- [Flagger — Progressive Delivery with Istio](https://docs.flagger.app/tutorials/istio-progressive-delivery)
- [AWS CodeDeploy — Blue/Green Deployments on ECS](https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-steps-ecs.html)
- [Azure Container Apps — Blue-Green Deployment](https://learn.microsoft.com/en-us/azure/container-apps/blue-green-deployment)
- [Google Cloud — Canary Deployment Strategy](https://cloud.google.com/deploy/docs/deployment-strategies/canary)
- [Martin Fowler — Blue-Green Deployment](https://martinfowler.com/bliki/BlueGreenDeployment.html)
- [DORA — Accelerate State of DevOps Report](https://dora.dev/research/)

HTTP/3 and QUIC — The Next-Generation Network Protocol Accelerating the Web in 2026

Database Sharding — Data Partitioning Strategies When Your System Hits the Ceiling

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.