Zero-Downtime Database Migration: Expand-Contract, EF Core and Batch Backfill for Production

Posted on: 4/22/2026 11:16:02 AM

Most developers are accustomed to running dotnet ef database update and then deploying new code. But in production with millions of requests per hour, a single ALTER TABLE ADD COLUMN ... NOT NULL can lock a table for minutes — and every minute of downtime can cost thousands of dollars. This article dives deep into strategies for changing database schemas without downtime, from core patterns to concrete implementations on .NET 10 and SQL Server.

99.99%Uptime requirement = only 52 min downtime/year
~$5,600Average cost per minute of downtime (SMB)
3 phasesExpand → Migrate → Contract
0 locksGoal: no blocking locks on production

1. Why Database Migration Is the Achilles' Heel of Zero-Downtime Deployment

You've set up a perfect Blue-Green deployment, your CI/CD pipeline runs smoothly, and containers auto-scale. But when deploying a new version that needs to add a column to a 50-million-row table — everything collapses. Here's why:

  • Schema lock: Many DDL statements require a Schema Modification Lock (Sch-M) on SQL Server, blocking all queries running against that table.
  • Version mismatch: New code expects the new column, old code doesn't know it exists. During rolling deployment, two versions run simultaneously → crash or data inconsistency.
  • Rollback complexity: Unlike code that can be rolled back with git revert, once a schema change is applied, the data has been transformed — rollback may lose data.
  • Tight coupling: When migration runs with application startup (auto-migrate), one pod runs migration while 9 other pods serve traffic with the old schema.

Common anti-pattern

Never let EF Core auto-run Database.Migrate() inside Program.cs on production. This means the first pod to start will run the migration while traffic is incoming — creating a race condition between schema changes and request handling.

2. The Expand-Contract Pattern — Foundation for All Zero-Downtime Migrations

Expand-Contract (also known as Parallel Change) is the core pattern: instead of making direct destructive changes, you split the work into 3 separate phases, each backward-compatible.

graph LR
    A["Initial State
Schema V1 + Code V1"] --> B["EXPAND
Schema V2 (additions only)
Code V1 still works"] B --> C["MIGRATE
Schema V2
Code V2 (uses old+new)
Backfill data"] C --> D["CONTRACT
Schema V3 (remove old)
Code V3 (new only)"] style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style B fill:#e8f5e9,stroke:#4CAF50,color:#2c3e50 style C fill:#fff3e0,stroke:#ff9800,color:#2c3e50 style D fill:#e3f2fd,stroke:#2196F3,color:#2c3e50

Three phases of the Expand-Contract Pattern

2.1 Expand Phase — Add Without Breaking

Add new elements (columns, tables, indexes) without removing or modifying anything existing. Old code (V1) continues to work because it doesn't know about the new column — new columns must be NULLABLE or have a DEFAULT.

-- Expand: add new column, nullable, doesn't break old code
ALTER TABLE dbo.Users ADD Email NVARCHAR(256) NULL;

-- Add index ONLINE to avoid blocking reads/writes
CREATE NONCLUSTERED INDEX IX_Users_Email
ON dbo.Users (Email)
WITH (ONLINE = ON);

2.2 Migrate Phase — Transition Gradually

Deploy new code that writes to both old and new columns (dual-write). Simultaneously run a background job to backfill data from old to new for existing rows. When 100% of data has been migrated, new code starts reading from the new column.

// Dual-write in application code
public async Task UpdateUser(int userId, string newEmail)
{
    // Write to both columns during transition
    await _db.ExecuteAsync(@"
        UPDATE Users
        SET Email = @Email,           -- new column
            ContactInfo = @Email      -- old column (backward compat)
        WHERE Id = @UserId",
        new { Email = newEmail, UserId = userId });
}

2.3 Contract Phase — Clean Up

After confirming no code reads/writes the old column (monitor long enough, typically 1–2 sprints), drop the old column and unnecessary constraints.

-- Contract: only run after confirming 100% traffic uses new column
ALTER TABLE dbo.Users DROP COLUMN ContactInfo;

Golden rule

Each deployment should execute only one phase. Deploy 1: Expand (add column). Deploy 2: Migrate (new code + backfill). Deploy 3: Contract (remove old). Never combine Expand + Contract in a single release.

3. Common Migration Scenarios and Solutions

3.1 Adding a New Column

The simplest migration, yet often done wrong. Common mistake: adding a NOT NULL column without a DEFAULT.

ApproachDowntime?Explanation
ADD col INT NOT NULLYes — full table lockSQL Server must scan all rows to verify constraint
ADD col INT NULLNo — metadata-onlyOnly changes metadata, no data scan
ADD col INT NOT NULL DEFAULT 0No (SQL Server 2012+)Default stored in metadata, no immediate backfill
ADD col INT NULL → backfill → ALTER NOT NULLDepends — final step needs Sch-MSafest for large tables, batch backfill

3.2 Renaming a Column

Never use sp_rename directly on production. Instead, use Expand-Contract:

  1. Expand: Add new column with desired name, trigger to sync data from old to new.
  2. Migrate: New code reads/writes new column. Backfill all existing data.
  3. Contract: Drop trigger and old column.
-- Expand: add new column + sync trigger
ALTER TABLE dbo.Orders ADD CustomerEmail NVARCHAR(256) NULL;

CREATE TRIGGER trg_SyncEmail ON dbo.Orders
AFTER INSERT, UPDATE AS
BEGIN
    SET NOCOUNT ON;
    UPDATE o SET o.CustomerEmail = i.CustEmail
    FROM dbo.Orders o INNER JOIN inserted i ON o.Id = i.Id
    WHERE i.CustEmail IS NOT NULL AND o.CustomerEmail IS NULL;
END;

-- Backfill in batches of 10,000 rows
WHILE 1=1
BEGIN
    UPDATE TOP (10000) dbo.Orders
    SET CustomerEmail = CustEmail
    WHERE CustomerEmail IS NULL AND CustEmail IS NOT NULL;

    IF @@ROWCOUNT = 0 BREAK;
    WAITFOR DELAY '00:00:01'; -- throttle to reduce load
END;

3.3 Changing Data Type

Example: changing VARCHAR(50) to NVARCHAR(256). This is a dangerous migration because ALTER COLUMN on a large table will rebuild everything.

graph TD
    A["Create new column
NVARCHAR(256) NULL"] --> B["Deploy dual-write code
Write old + new columns"] B --> C["Batch backfill
Copy old → new"] C --> D{"100% data
migrated?"} D -- "No" --> C D -- "Yes" --> E["Deploy code reading new col"] E --> F["Monitor 1-2 weeks"] F --> G["Drop old column"] style A fill:#e8f5e9,stroke:#4CAF50,color:#2c3e50 style B fill:#fff3e0,stroke:#ff9800,color:#2c3e50 style C fill:#fff3e0,stroke:#ff9800,color:#2c3e50 style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style E fill:#e3f2fd,stroke:#2196F3,color:#2c3e50 style F fill:#e3f2fd,stroke:#2196F3,color:#2c3e50 style G fill:#fce4ec,stroke:#e94560,color:#2c3e50

Flow chart for data type change migration

3.4 Dropping a Column

Sounds simple, but if running code still references that column → crash. The process:

  1. Deploy 1: New code no longer reads/writes that column (but column still exists).
  2. Monitor: Wait 1–2 weeks, verify no queries touch that column via sys.dm_exec_query_stats.
  3. Deploy 2: ALTER TABLE DROP COLUMN.

3.5 Adding NOT NULL Constraint

Adding a NOT NULL constraint to an existing column requires SQL Server to scan the entire table to verify no NULLs exist. For large tables, this causes prolonged locking.

-- Step 1: Add CHECK constraint WITH NOCHECK (no data scan)
ALTER TABLE dbo.Users WITH NOCHECK
ADD CONSTRAINT CK_Users_Email_NotNull CHECK (Email IS NOT NULL);

-- Step 2: Backfill any NULL rows
UPDATE TOP (10000) dbo.Users SET Email = '' WHERE Email IS NULL;

-- Step 3: When data is 100% clean, enable constraint verification
ALTER TABLE dbo.Users WITH CHECK
CHECK CONSTRAINT CK_Users_Email_NotNull;

-- Step 4 (optional): Convert to actual NOT NULL if needed
ALTER TABLE dbo.Users ALTER COLUMN Email NVARCHAR(256) NOT NULL;

4. EF Core Migration Workflow for Production

Entity Framework Core is the most popular ORM on .NET, but its default migration workflow isn't suitable for zero-downtime. The approach needs to change.

4.1 Decouple Migration from Application Startup

// Program.cs — DO NOT do this in production
// app.Services.GetRequiredService<AppDbContext>().Database.Migrate();

// Instead: create a separate migration bundle
// Terminal: dotnet ef migrations bundle --self-contained -o migrate.exe
// CI/CD: run migrate.exe BEFORE deploying new code

Best practice is to create a dedicated Kubernetes Job or Azure Container Instance solely for running migrations:

# k8s-migration-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration-v42
spec:
  template:
    spec:
      containers:
      - name: migrator
        image: myapp:v42-migrator
        command: ["./migrate"]
        env:
        - name: ConnectionStrings__Default
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: connection-string
      restartPolicy: Never
  backoffLimit: 1

4.2 Writing Safe Migrations with EF Core

EF Core generates migrations based on the diff between model snapshots. But the generated code is often not zero-downtime safe. It needs review and correction:

// Auto-generated migration — NOT safe
public partial class AddUserEmail : Migration
{
    protected override void Up(MigrationBuilder migrationBuilder)
    {
        // EF auto-generates NOT NULL — will lock table
        migrationBuilder.AddColumn<string>(
            name: "Email",
            table: "Users",
            type: "nvarchar(256)",
            nullable: false,      // <-- problem here
            defaultValue: "");
    }
}

// Fixed safe version
public partial class AddUserEmail : Migration
{
    protected override void Up(MigrationBuilder migrationBuilder)
    {
        // Step 1: Add as nullable first
        migrationBuilder.AddColumn<string>(
            name: "Email",
            table: "Users",
            type: "nvarchar(256)",
            nullable: true);       // nullable = safe

        // Step 2: Backfill will run in a separate migration later
    }
}

4.3 Tooling: EF Core Migration Analyzers

Use libraries like ErikEJ.EFCorePowerTools or custom Roslyn analyzers to automatically detect unsafe migrations:

// Roslyn analyzer checks whether a migration is safe
// Detects: AddColumn with nullable: false without defaultValue
// Detects: DropColumn, DropTable in same migration as AddColumn
// Detects: AlterColumn changing data type

Migration Review Strategy

Every PR with a migration file must answer 3 questions: (1) Will old code crash when the new schema is applied? (2) Will new code crash if the old schema is still running? (3) Will rolling back the schema lose data? If any answer is "yes" → split into multiple deployments.

5. SQL Server — Online Operations and Lock Management

SQL Server has many features supporting online schema changes, but you need to understand which version supports what.

5.1 Online Index Operations

OperationOnline SupportRequired Edition
CREATE INDEXWITH (ONLINE = ON)Enterprise / Developer
ALTER INDEX REBUILDWITH (ONLINE = ON)Enterprise / Developer
ALTER COLUMN (data type)WITH (ONLINE = ON)SQL Server 2016+ Enterprise
ADD COLUMN (nullable)Automatically onlineAll editions
ADD COLUMN (NOT NULL + DEFAULT)Metadata-only (2012+)All editions from 2012
DROP COLUMNAutomatically onlineAll editions
ADD FOREIGN KEYWITH NOCHECK to avoid scanAll editions

5.2 Lock Escalation and Control

SQL Server automatically escalates from row lock → page lock → table lock when the lock count exceeds thresholds. With large tables, a batch update can trigger a table lock.

-- Control lock escalation during backfill
ALTER TABLE dbo.Orders SET (LOCK_ESCALATION = DISABLE);

-- Backfill in small batches, each in its own transaction
DECLARE @BatchSize INT = 5000;
DECLARE @RowsAffected INT = 1;

WHILE @RowsAffected > 0
BEGIN
    BEGIN TRANSACTION;

    UPDATE TOP (@BatchSize) dbo.Orders
    SET CustomerEmail = CustEmail
    WHERE CustomerEmail IS NULL AND CustEmail IS NOT NULL;

    SET @RowsAffected = @@ROWCOUNT;

    COMMIT TRANSACTION;

    -- Pause between batches to ease I/O pressure
    IF @RowsAffected > 0
        WAITFOR DELAY '00:00:02';
END;

-- Re-enable lock escalation
ALTER TABLE dbo.Orders SET (LOCK_ESCALATION = TABLE);

5.3 Monitoring Locks During Migration

-- Check active locks
SELECT
    r.session_id,
    r.blocking_session_id,
    r.wait_type,
    r.wait_time,
    t.text AS query_text,
    r.status
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.blocking_session_id > 0;

-- Check schema locks
SELECT
    resource_type,
    request_mode,
    request_status,
    request_session_id
FROM sys.dm_tran_locks
WHERE resource_type = 'OBJECT'
AND request_mode IN ('Sch-M', 'Sch-S');

6. Online Schema Change Tools — When Native DDL Isn't Enough

For MySQL, the two most popular tools are gh-ost (GitHub) and pt-online-schema-change (Percona). They apply the Expand-Contract principle at the tooling level.

graph TB
    subgraph "gh-ost (GitHub)"
        G1["1. Create ghost table
copy original schema"] --> G2["2. Apply ALTER
on ghost table"] G2 --> G3["3. Stream binlog
sync data realtime"] G3 --> G4["4. Cut-over
rename tables (atomic)"] end subgraph "pt-online-schema-change (Percona)" P1["1. Create new table
with new schema"] --> P2["2. Attach triggers
sync INSERT/UPDATE/DELETE"] P2 --> P3["3. Copy data in batches
from original table"] P3 --> P4["4. Swap tables
atomic rename"] end style G1 fill:#e8f5e9,stroke:#4CAF50,color:#2c3e50 style G2 fill:#e8f5e9,stroke:#4CAF50,color:#2c3e50 style G3 fill:#e8f5e9,stroke:#4CAF50,color:#2c3e50 style G4 fill:#e8f5e9,stroke:#4CAF50,color:#2c3e50 style P1 fill:#e3f2fd,stroke:#2196F3,color:#2c3e50 style P2 fill:#e3f2fd,stroke:#2196F3,color:#2c3e50 style P3 fill:#e3f2fd,stroke:#2196F3,color:#2c3e50 style P4 fill:#e3f2fd,stroke:#2196F3,color:#2c3e50

Comparing gh-ost vs pt-online-schema-change mechanisms

Criteriagh-ostpt-online-schema-change
Sync mechanismBinary log streamingTriggers on original table
Production impactLow — no trigger overheadMedium — triggers add write latency
Foreign keysNot supportedSupported (with limitations)
ThrottlingAutomatic based on replication lagManual chunk size configuration
RollbackDrop ghost tableDrop new table + remove triggers
Best forHigh-traffic, modern infraLegacy, with FK constraints

What about SQL Server?

SQL Server doesn't have direct equivalents to gh-ost/pt-osc. Instead, use a combination of: WITH (ONLINE = ON) for index operations, metadata-only ADD COLUMN, and batch updates for backfill. For major changes, consider the Blue-Green Database pattern: maintain two databases, sync via CDC or replication.

7. Blue-Green Database Pattern

When schema changes are too complex for standard Expand-Contract (e.g., complete table restructuring, merging/splitting tables), Blue-Green Database is the last resort.

graph LR
    LB["Load Balancer"] --> Blue["BLUE (Production)
Schema V1
Serving traffic"] LB -.-> Green["GREEN (Staging)
Schema V2
Preparing"] Blue -- "CDC / Replication" --> Green Green --> Switch{"Ready?"} Switch -- "Yes" --> LB2["Load Balancer
Switch traffic → Green"] Switch -- "No" --> Green style Blue fill:#e3f2fd,stroke:#2196F3,color:#2c3e50 style Green fill:#e8f5e9,stroke:#4CAF50,color:#2c3e50 style LB fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style LB2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style Switch fill:#fff3e0,stroke:#ff9800,color:#2c3e50

Blue-Green Database: schema transition via two parallel databases

Process:

  1. Create Green database from Blue backup.
  2. Apply schema migration to Green (can lock freely since Green isn't serving traffic).
  3. Set up CDC (Change Data Capture) or replication from Blue → Green to sync data in realtime.
  4. Test Green database with smoke tests and load tests.
  5. Cut-over: switch connection string from Blue to Green. Downtime is measured in milliseconds.
  6. Blue becomes the rollback target for a few hours after cut-over.

Blue-Green Database limitations

Double the cost (maintaining 2 databases). CDC/Replication is complex, especially when schema has changed. Not suitable for small migrations — only use when Expand-Contract isn't feasible.

8. Safe Data Backfill — The Art of Batch Processing

Backfill is the most dangerous step in migration because it touches real data. A single UPDATE on 50 million rows can:

  • Overflow the transaction log (tempdb/log file growing by GBs).
  • Trigger lock escalation → block all queries on the table.
  • Cause replication lag to spike if replicas exist.

8.1 Pattern: Batched Backfill with Throttling

public class BackfillService : BackgroundService
{
    private readonly IDbConnectionFactory _db;
    private readonly ILogger<BackfillService> _logger;

    protected override async Task ExecuteAsync(CancellationToken ct)
    {
        const int batchSize = 5_000;
        const int delayMs = 2_000;
        long totalUpdated = 0;

        while (!ct.IsCancellationRequested)
        {
            var affected = await _db.ExecuteAsync(@"
                UPDATE TOP (@Batch) dbo.Users
                SET EmailNormalized = UPPER(Email)
                WHERE EmailNormalized IS NULL AND Email IS NOT NULL",
                new { Batch = batchSize });

            totalUpdated += affected;
            _logger.LogInformation(
                "Backfill progress: {Total} rows updated", totalUpdated);

            if (affected == 0)
            {
                _logger.LogInformation("Backfill complete!");
                break;
            }

            // Throttle: pause between batches to let the DB breathe
            await Task.Delay(delayMs, ct);
        }
    }
}

8.2 Monitor Backfill Progress

-- Check backfill progress
SELECT
    COUNT(*) AS TotalRows,
    SUM(CASE WHEN EmailNormalized IS NOT NULL THEN 1 ELSE 0 END) AS Migrated,
    SUM(CASE WHEN EmailNormalized IS NULL AND Email IS NOT NULL THEN 1 ELSE 0 END) AS Pending,
    CAST(
        SUM(CASE WHEN EmailNormalized IS NOT NULL THEN 1.0 ELSE 0 END)
        / COUNT(*) * 100 AS DECIMAL(5,2)
    ) AS PercentComplete
FROM dbo.Users;

9. Rollback Strategy — Preparing for When Things Go Wrong

Unlike code deployment where you can rollback by deploying the previous version, database migration rollback is far more complex because data has already changed.

9.1 Forward-Only Migration

Philosophy: don't rollback migrations, only roll forward. If a migration causes issues, create a new migration to fix it rather than reverting. Reasons:

  • If you ADD COLUMN and data has been written to it, DROP COLUMN = data loss.
  • If you backfill data then rollback the schema, the data transformation has already occurred.
  • Forward-only is simpler and less risky than reverse migrations.

9.2 When You Need a Rollback Plan

Migration typeRollback possible?Strategy
Add nullable columnYes — drop columnSafe since existing data isn't affected
Rename columnYes — rename backBut deployed code needs rollback too
Change data type (widen)ComplexData already written in new format, needs reverse transform
Drop columnNoData is gone. Need to restore from backup
Merge tablesVery complexUse Blue-Green Database for instant rollback

Checkpoint before every migration

Always create a database snapshot (SQL Server) or point-in-time backup before running migrations. Not for regular rollbacks, but as insurance when everything truly breaks.

10. CI/CD Pipeline for Database Migration

Migration must be a separate step in the pipeline, running before application deployment.

graph LR
    A["Code Push"] --> B["Build + Test"]
    B --> C["Generate
Migration Bundle"] C --> D["Review
Migration SQL"] D --> E["Apply Migration
(K8s Job)"] E --> F["Verify Schema"] F --> G["Deploy App
(Rolling Update)"] G --> H["Smoke Test"] H --> I["Monitor
30 minutes"] style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style B fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style C fill:#e8f5e9,stroke:#4CAF50,color:#2c3e50 style D fill:#fff3e0,stroke:#ff9800,color:#2c3e50 style E fill:#e3f2fd,stroke:#2196F3,color:#2c3e50 style F fill:#e3f2fd,stroke:#2196F3,color:#2c3e50 style G fill:#e8f5e9,stroke:#4CAF50,color:#2c3e50 style H fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style I fill:#f8f9fa,stroke:#e94560,color:#2c3e50

CI/CD Pipeline with separate Database Migration step

# GitHub Actions workflow
name: Deploy with Migration

on:
  push:
    branches: [main]

jobs:
  migrate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup .NET 10
        uses: actions/setup-dotnet@v4
        with:
          dotnet-version: '10.0.x'

      - name: Build Migration Bundle
        run: |
          dotnet tool restore
          dotnet ef migrations bundle \
            --project src/MyApp.Data \
            --startup-project src/MyApp.Api \
            --self-contained \
            -o ./migrate

      - name: Generate Migration SQL (for review)
        run: |
          dotnet ef migrations script \
            --project src/MyApp.Data \
            --idempotent \
            -o migration.sql

      - name: Apply Migration
        run: ./migrate --connection "${{ secrets.DB_CONNECTION }}"

      - name: Verify Schema
        run: |
          dotnet ef database verify \
            --project src/MyApp.Data \
            --connection "${{ secrets.DB_CONNECTION }}"

  deploy:
    needs: migrate
    runs-on: ubuntu-latest
    steps:
      - name: Deploy to Kubernetes
        run: kubectl rollout restart deployment/myapp

11. Safe Migration Checklist

Before every production migration, go through this checklist:

#CheckAction if failed
1Can old code (N-1) run with the new schema?Split migration, apply Expand-Contract
2Can new code (N) run with the old schema?Deploy code first, migration after
3Does migration require a table lock?Use ONLINE option or batch approach
4How many rows in the table?>1M rows: mandatory batch backfill
5Is there a backup/snapshot before migration?Create snapshot before running
6Is migration idempotent?Add IF EXISTS/IF NOT EXISTS checks
7Is rollback plan documented?Write rollback script before applying
8Tested on staging with equivalent data volume?Clone production data (anonymized) to staging

12. Conclusion

Zero-downtime database migration isn't magic — it's engineering discipline. Core principles:

  • Expand-Contract is the foundational pattern: add first, transition gradually, remove later.
  • Decouple migration from deployment: Migration runs separately, code deploys separately.
  • Backward compatibility: Each step must be backward-compatible with the previous version.
  • Batch everything: Never update millions of rows in a single transaction.
  • Monitor before, during, and after: Lock wait time, replication lag, error rate.
  • Forward-only mindset: Prefer rolling forward over rolling back.

Where to start?

If your team has never done zero-downtime migration: start by decoupling migration from Program.cs. This single step alone eliminates 80% of risk. Then gradually adopt Expand-Contract for complex changes.

References