Durable Execution: Building Crash-Proof Workflows in Distributed Systems

Posted on: 4/22/2026 11:10:40 PM

The Problem with Traditional Workflows

Imagine you're building an e-commerce order processing system. The flow includes: verify payment → deduct inventory → send confirmation email → call shipping API → update status. What happens if the server crashes in the middle of step 3?

With the traditional approach, you must manage state yourself: save state to a database after each step, write manual retry logic, handle idempotency, and build cron jobs to "sweep" stuck orders. Your 50-line business logic suddenly balloons into 500 lines of infrastructure code.

What is Durable Execution?

Durable Execution is a model that lets you write straightforward sequential code, while the platform guarantees that code will run to completion — even if servers crash, networks time out, or deployments happen mid-execution. State is automatically persisted and restored without the developer writing a single line of storage code.

0 Lines of manual state management
100% Guaranteed run-to-completion
~2000+ Companies using Temporal in production
Sec → Years Possible workflow duration

How It Works: Event History and Replay

At the heart of Durable Execution lies the Event History — an immutable, append-only log recording every event in a workflow. When a worker crashes, the platform replays the event history on a new worker, reconstructing the entire state without re-executing side effects.

sequenceDiagram
    participant W as Worker
    participant S as Server/Scheduler
    participant DB as Event Store

    W->>S: Start Workflow
    S->>DB: Write WorkflowStarted
    W->>S: Activity: Verify payment ✓
    S->>DB: Write ActivityCompleted(payment)
    W->>S: Activity: Deduct inventory ✓
    S->>DB: Write ActivityCompleted(inventory)
    Note over W: 💥 Worker CRASH!
    S-->>W: New worker assigned
    S->>DB: Read Event History
    DB-->>S: [Started, Payment✓, Inventory✓]
    S-->>W: Replay → skip payment, skip inventory
    W->>S: Activity: Send email (resume from step 3)
    S->>DB: Write ActivityCompleted(email)

Replay mechanism: new worker reads event history, skips completed activities, resumes from the break point

The Deterministic Constraint

The most critical concept to understand: workflow code must be deterministic. During replay, the platform re-executes the workflow code from the beginning, but instead of actually running activities, it matches them against the event history. If code is non-deterministic (e.g., using DateTime.Now or Random directly), replay produces different results and the workflow fails.

⚠️ What NOT to use in workflow code

Forbidden: DateTime.Now, Random, Thread.Sleep, direct API/DB calls, file I/O, mutable environment variables.
Alternatives: Use platform APIs — Workflow.CurrentTime, Workflow.Random, Workflow.Sleep. All side effects must live inside Activities.

Temporal — Architecture and Real-World Code

Temporal is the most widely adopted durable execution engine today, used in production by Netflix, DoorDash, Stripe, and Snap. It's open-source (MIT license) with a managed cloud option.

Temporal Architecture

graph TB
    subgraph Client["Client Application"]
        A["Temporal Client
SDK"] end subgraph TS["Temporal Server Cluster"] F["Frontend Service
API Gateway"] H["History Service
Event Storage + Replay"] M["Matching Service
Task Queue Dispatch"] W2["Internal Worker"] end subgraph Workers["Worker Fleet"] W1["Worker 1
Workflow + Activity"] W3["Worker 2
Workflow + Activity"] W4["Worker N
Workflow + Activity"] end subgraph Storage["Persistence"] DB2["Database
PostgreSQL / MySQL / Cassandra"] ES["Elasticsearch
Visibility"] end A -->|"StartWorkflow
Signal/Query"| F F --> H F --> M H --> DB2 M -->|"Dispatch Task"| W1 M -->|"Dispatch Task"| W3 M -->|"Dispatch Task"| W4 H --> ES style A fill:#e94560,stroke:#fff,color:#fff style F fill:#2c3e50,stroke:#fff,color:#fff style H fill:#2c3e50,stroke:#fff,color:#fff style M fill:#2c3e50,stroke:#fff,color:#fff style W1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style W3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style W4 fill:#f8f9fa,stroke:#e94560,color:#2c3e50 style DB2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50 style ES fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Temporal Server architecture: Frontend receives requests, History manages events, Matching dispatches tasks to Workers

Code Example: Order Processing Workflow

Here's an order processing workflow using the Temporal .NET SDK:

// Workflow Interface Definition
[Workflow]
public class OrderWorkflow
{
    [WorkflowRun]
    public async Task<OrderResult> RunAsync(Order order)
    {
        // Step 1: Verify payment
        var paymentResult = await Workflow.ExecuteActivityAsync(
            (OrderActivities act) => act.VerifyPaymentAsync(order.PaymentInfo),
            new ActivityOptions { StartToCloseTimeout = TimeSpan.FromSeconds(30) });

        if (!paymentResult.Success)
            return OrderResult.Failed("Payment verification failed");

        // Step 2: Reserve inventory — with compensation if later steps fail
        await Workflow.ExecuteActivityAsync(
            (OrderActivities act) => act.ReserveInventoryAsync(order.Items),
            new ActivityOptions
            {
                StartToCloseTimeout = TimeSpan.FromMinutes(1),
                RetryPolicy = new RetryPolicy { MaximumAttempts = 3 }
            });

        // Step 3: Send confirmation email
        await Workflow.ExecuteActivityAsync(
            (OrderActivities act) => act.SendConfirmationEmailAsync(order),
            new ActivityOptions { StartToCloseTimeout = TimeSpan.FromSeconds(15) });

        // Step 4: Create shipment — may take hours/days
        var trackingId = await Workflow.ExecuteActivityAsync(
            (OrderActivities act) => act.CreateShipmentAsync(order),
            new ActivityOptions
            {
                StartToCloseTimeout = TimeSpan.FromMinutes(5),
                RetryPolicy = new RetryPolicy
                {
                    MaximumAttempts = 5,
                    InitialInterval = TimeSpan.FromSeconds(10),
                    BackoffCoefficient = 2.0
                }
            });

        // Step 5: Wait for delivery confirmation (may take days)
        var delivered = await Workflow.WaitConditionAsync(
            () => _deliveryConfirmed,
            timeout: TimeSpan.FromDays(14));

        return delivered
            ? OrderResult.Completed(trackingId)
            : OrderResult.DeliveryTimeout(trackingId);
    }

    private bool _deliveryConfirmed;

    [WorkflowSignal]
    public async Task ConfirmDeliveryAsync()
    {
        _deliveryConfirmed = true;
    }

    [WorkflowQuery]
    public string GetStatus() => _currentStatus;
}
// Activity Implementation — where side effects live
[Activity]
public class OrderActivities
{
    private readonly IPaymentGateway _payment;
    private readonly IInventoryService _inventory;
    private readonly IEmailService _email;

    public OrderActivities(
        IPaymentGateway payment,
        IInventoryService inventory,
        IEmailService email)
    {
        _payment = payment;
        _inventory = inventory;
        _email = email;
    }

    [Activity]
    public async Task<PaymentResult> VerifyPaymentAsync(PaymentInfo info)
        => await _payment.ChargeAsync(info);

    [Activity]
    public async Task ReserveInventoryAsync(List<OrderItem> items)
        => await _inventory.ReserveAsync(items);

    [Activity]
    public async Task SendConfirmationEmailAsync(Order order)
        => await _email.SendOrderConfirmationAsync(order.CustomerEmail, order);

    [Activity]
    public async Task<string> CreateShipmentAsync(Order order)
        => await _inventory.CreateShipmentAsync(order.ShippingAddress, order.Items);
}

💡 Signals and Queries

Signals allow sending events into a running workflow from outside (e.g., a delivery confirmation webhook). Queries allow reading the current workflow state without affecting execution. Both are powerful mechanisms for interacting with long-running workflows.

Azure Durable Functions — Serverless Durable Execution

Azure Durable Functions is an extension of Azure Functions that provides durable execution in a serverless environment. It's the best fit if you're already in the Azure ecosystem and need simple-to-moderate workflow complexity.

// Orchestrator Function — equivalent to Temporal Workflow
[Function("OrderOrchestrator")]
public static async Task<OrderResult> RunOrchestrator(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var order = context.GetInput<Order>();

    // Each CallActivityAsync is equivalent to a Temporal Activity
    var payment = await context.CallActivityAsync<PaymentResult>(
        "VerifyPayment", order.PaymentInfo);

    if (!payment.Success)
        return OrderResult.Failed("Payment failed");

    await context.CallActivityAsync("ReserveInventory", order.Items);

    await context.CallActivityAsync("SendConfirmationEmail", order);

    var trackingId = await context.CallActivityAsync<string>(
        "CreateShipment", order);

    // Durable Timer — wait up to 14 days
    using var cts = new CancellationTokenSource();
    var deadline = context.CurrentUtcDateTime.AddDays(14);
    var timerTask = context.CreateTimer(deadline, cts.Token);

    // Wait for external event (similar to Signal in Temporal)
    var deliveryEvent = context.WaitForExternalEvent<bool>("DeliveryConfirmed");

    var winner = await Task.WhenAny(deliveryEvent, timerTask);
    if (winner == deliveryEvent)
    {
        cts.Cancel();
        return OrderResult.Completed(trackingId);
    }

    return OrderResult.DeliveryTimeout(trackingId);
}

// Activity Function
[Function("VerifyPayment")]
public static async Task<PaymentResult> VerifyPayment(
    [ActivityTrigger] PaymentInfo info,
    [FromServices] IPaymentGateway gateway)
{
    return await gateway.ChargeAsync(info);
}

Common Durable Functions Patterns

graph LR
    subgraph FC["Function Chaining"]
        A1["Activity A"] --> A2["Activity B"] --> A3["Activity C"]
    end

    subgraph FO["Fan-out / Fan-in"]
        B1["Start"] --> B2["Task 1"]
        B1 --> B3["Task 2"]
        B1 --> B4["Task 3"]
        B2 --> B5["Aggregate"]
        B3 --> B5
        B4 --> B5
    end

    subgraph MN["Monitor"]
        C1["Check"] --> C2{"Done?"}
        C2 -->|No| C3["Timer"] --> C1
        C2 -->|Yes| C4["Complete"]
    end

    style A1 fill:#e94560,stroke:#fff,color:#fff
    style A2 fill:#e94560,stroke:#fff,color:#fff
    style A3 fill:#e94560,stroke:#fff,color:#fff
    style B1 fill:#2c3e50,stroke:#fff,color:#fff
    style B5 fill:#2c3e50,stroke:#fff,color:#fff
    style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C4 fill:#4CAF50,stroke:#fff,color:#fff

Three core patterns: Function Chaining (sequential), Fan-out/Fan-in (parallel), Monitor (polling with timer)

Restate — Lightweight Durable Execution

Restate is an emerging engine focused on simplicity. Instead of requiring you to organize code into separate workflow/activity constructs, Restate lets you mark any function as durable with just a decorator.

// Restate SDK — TypeScript
import * as restate from "@restatedev/restate-sdk";

const orderService = restate.service({
  name: "OrderService",
  handlers: {
    processOrder: restate.handlers.handler(
      async (ctx: restate.Context, order: Order) => {
        // ctx.run() creates a "journal entry" — similar to an activity
        const payment = await ctx.run("verify-payment", async () => {
          return await paymentGateway.charge(order.paymentInfo);
        });

        if (!payment.success) {
          return { status: "failed", reason: "payment" };
        }

        await ctx.run("reserve-inventory", async () => {
          await inventoryService.reserve(order.items);
        });

        await ctx.run("send-email", async () => {
          await emailService.sendConfirmation(order);
        });

        // Durable sleep — survives restarts
        await ctx.sleep(60_000); // wait 1 minute

        const tracking = await ctx.run("create-shipment", async () => {
          return await shippingService.create(order);
        });

        return { status: "completed", trackingId: tracking };
      }
    ),
  },
});

restate.endpoint().bind(orderService).listen(9080);

💡 What Makes Restate Different

Restate doesn't require running a complex separate cluster like Temporal. The Restate server is a lightweight single binary, and handlers are regular HTTP endpoints. It's especially well-suited for small teams that want durable execution without operating complex infrastructure.

Detailed Comparison of 3 Engines

Criteria Temporal Azure Durable Functions Restate
Hosting Self-hosted or Temporal Cloud Azure serverless (consumption plan) Self-hosted (single binary)
Language SDKs Go, Java, TypeScript, Python, .NET, PHP C#, JavaScript, Python, Java, PowerShell TypeScript, Java, Kotlin, Go, Python
Persistence PostgreSQL, MySQL, Cassandra, SQLite Azure Storage / MS SQL / Netherite RocksDB (embedded) or external
Scalability Millions of concurrent workflows Auto-scale, limited by storage backend Good for medium scale
Versioning Worker Versioning (2026), Build ID Task hub versioning, code conditions Service versioning via deployment
Observability Web UI, Metrics, built-in Tracing Azure Monitor, Application Insights OpenTelemetry, Admin API
Operational Complexity Medium — requires cluster + DB Low — fully managed Low — single binary
Pricing OSS free, Cloud pay-per-action Pay-per-execution (serverless) OSS free, Cloud option
Best For Enterprise, high-throughput, multi-cloud Azure-native, serverless workloads Small teams, startups, lightweight needs

When Do You Need Durable Execution?

Durable Execution is not a silver bullet. Here's a checklist to help you decide:

✅ Use it when

  • Long-running processes: Workflows spanning hours, days, or weeks (order processing, onboarding, subscription billing)
  • Complex saga patterns: Compensating transactions across multiple services with clear rollback logic
  • Human-in-the-loop: Workflows awaiting user approval (leave requests, expense reports)
  • Complex scheduled jobs: Stateful cron jobs needing retry, monitoring (data pipelines, report generation)
  • Reliable async operations: Calling unreliable external APIs with retry + timeout + fallback requirements

❌ Don't use it when

  • Simple request-response: APIs returning results in milliseconds — overhead isn't worth it
  • Stateless processing: Event processing without state tracking (log aggregation, simple ETL)
  • Strict real-time: Latency requirements < 10ms — replay mechanism adds overhead
  • Team not ready: The learning curve is steep, especially the deterministic constraint

Production Best Practices

1. Design Activities Correctly

// ❌ WRONG: Activity too large, doing too many things
[Activity]
public async Task ProcessEntireOrder(Order order)
{
    await VerifyPayment(order);
    await ReserveInventory(order);
    await SendEmail(order);       // If this fails, must retry EVERYTHING
    await CreateShipment(order);
}

// ✅ RIGHT: Each Activity is an independent retry unit
[Activity]
public async Task<PaymentResult> VerifyPayment(PaymentInfo info) { ... }

[Activity]
public async Task ReserveInventory(List<OrderItem> items) { ... }

[Activity]
public async Task SendEmail(Order order) { ... }

2. Ensure Activity Idempotency

Activities can be retried at any time (network timeout after successful execution). Every activity must be idempotent — running it multiple times produces the same result.

[Activity]
public async Task<PaymentResult> ChargePayment(string orderId, decimal amount)
{
    // Use idempotency key to prevent double-charging
    var idempotencyKey = $"order-payment-{orderId}";
    return await _paymentGateway.ChargeAsync(amount, idempotencyKey);
}

3. Safe Workflow Versioning

graph TB
    subgraph V1["Version 1 (running)"]
        A["Step A"] --> B["Step B"] --> C["Step C"]
    end

    subgraph V2["Version 2 (newly deployed)"]
        D["Step A"] --> E["Step B"] --> F["Step B2 (new)"] --> G["Step C"]
    end

    subgraph Strategy["Strategy"]
        S1["New workflows → V2"]
        S2["Running workflows → V1 until completion"]
    end

    style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style D fill:#e94560,stroke:#fff,color:#fff
    style E fill:#e94560,stroke:#fff,color:#fff
    style F fill:#4CAF50,stroke:#fff,color:#fff
    style G fill:#e94560,stroke:#fff,color:#fff
    style S1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
    style S2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50

Worker Versioning: running workflows continue on old code, new workflows use new code

4. Testing Workflows

// Temporal provides a test framework to test workflows without a server
[Fact]
public async Task OrderWorkflow_SuccessfulPayment_CompletesOrder()
{
    var env = await WorkflowEnvironment.StartTimeSkippingAsync();
    var worker = new TemporalWorker(env.Client, new TemporalWorkerOptions("test-queue")
        .AddWorkflow<OrderWorkflow>()
        .AddAllActivities(new OrderActivities(
            mockPayment.Object,
            mockInventory.Object,
            mockEmail.Object)));

    await worker.ExecuteAsync(async () =>
    {
        var result = await env.Client.ExecuteWorkflowAsync(
            (OrderWorkflow wf) => wf.RunAsync(testOrder),
            new WorkflowOptions { Id = "test-order-1", TaskQueue = "test-queue" });

        Assert.Equal("completed", result.Status);
    });
}

// TimeSkipping environment: Workflow.Sleep(14 days)
// actually runs in milliseconds during testing

Conclusion

Durable Execution is transforming how we build distributed systems. Instead of spending 80% of effort on infrastructure code (retry, state management, recovery), you focus 100% on business logic and let the platform handle the rest.

Choose Temporal if you need large scale, multi-cloud, and your team has the resources to operate it. Choose Azure Durable Functions if you're already in the Azure ecosystem and prioritize serverless. Choose Restate if you're a small team, need speed, and don't yet need massive scale.

Whichever engine you choose, remember: Activities must be idempotent, Workflows must be deterministic, and always have a versioning strategy before going to production.

References