Durable Execution: Building Crash-Proof Workflows in Distributed Systems
Posted on: 4/22/2026 11:10:40 PM
Table of contents
- The Problem with Traditional Workflows
- How It Works: Event History and Replay
- Temporal — Architecture and Real-World Code
- Azure Durable Functions — Serverless Durable Execution
- Restate — Lightweight Durable Execution
- Detailed Comparison of 3 Engines
- When Do You Need Durable Execution?
- Production Best Practices
- Conclusion
- References
The Problem with Traditional Workflows
Imagine you're building an e-commerce order processing system. The flow includes: verify payment → deduct inventory → send confirmation email → call shipping API → update status. What happens if the server crashes in the middle of step 3?
With the traditional approach, you must manage state yourself: save state to a database after each step, write manual retry logic, handle idempotency, and build cron jobs to "sweep" stuck orders. Your 50-line business logic suddenly balloons into 500 lines of infrastructure code.
What is Durable Execution?
Durable Execution is a model that lets you write straightforward sequential code, while the platform guarantees that code will run to completion — even if servers crash, networks time out, or deployments happen mid-execution. State is automatically persisted and restored without the developer writing a single line of storage code.
How It Works: Event History and Replay
At the heart of Durable Execution lies the Event History — an immutable, append-only log recording every event in a workflow. When a worker crashes, the platform replays the event history on a new worker, reconstructing the entire state without re-executing side effects.
sequenceDiagram
participant W as Worker
participant S as Server/Scheduler
participant DB as Event Store
W->>S: Start Workflow
S->>DB: Write WorkflowStarted
W->>S: Activity: Verify payment ✓
S->>DB: Write ActivityCompleted(payment)
W->>S: Activity: Deduct inventory ✓
S->>DB: Write ActivityCompleted(inventory)
Note over W: 💥 Worker CRASH!
S-->>W: New worker assigned
S->>DB: Read Event History
DB-->>S: [Started, Payment✓, Inventory✓]
S-->>W: Replay → skip payment, skip inventory
W->>S: Activity: Send email (resume from step 3)
S->>DB: Write ActivityCompleted(email)
Replay mechanism: new worker reads event history, skips completed activities, resumes from the break point
The Deterministic Constraint
The most critical concept to understand: workflow code must be deterministic. During replay, the platform re-executes the workflow code from the beginning, but instead of actually running activities, it matches them against the event history. If code is non-deterministic (e.g., using DateTime.Now or Random directly), replay produces different results and the workflow fails.
⚠️ What NOT to use in workflow code
Forbidden: DateTime.Now, Random, Thread.Sleep, direct API/DB calls, file I/O, mutable environment variables.
Alternatives: Use platform APIs — Workflow.CurrentTime, Workflow.Random, Workflow.Sleep. All side effects must live inside Activities.
Temporal — Architecture and Real-World Code
Temporal is the most widely adopted durable execution engine today, used in production by Netflix, DoorDash, Stripe, and Snap. It's open-source (MIT license) with a managed cloud option.
Temporal Architecture
graph TB
subgraph Client["Client Application"]
A["Temporal Client
SDK"]
end
subgraph TS["Temporal Server Cluster"]
F["Frontend Service
API Gateway"]
H["History Service
Event Storage + Replay"]
M["Matching Service
Task Queue Dispatch"]
W2["Internal Worker"]
end
subgraph Workers["Worker Fleet"]
W1["Worker 1
Workflow + Activity"]
W3["Worker 2
Workflow + Activity"]
W4["Worker N
Workflow + Activity"]
end
subgraph Storage["Persistence"]
DB2["Database
PostgreSQL / MySQL / Cassandra"]
ES["Elasticsearch
Visibility"]
end
A -->|"StartWorkflow
Signal/Query"| F
F --> H
F --> M
H --> DB2
M -->|"Dispatch Task"| W1
M -->|"Dispatch Task"| W3
M -->|"Dispatch Task"| W4
H --> ES
style A fill:#e94560,stroke:#fff,color:#fff
style F fill:#2c3e50,stroke:#fff,color:#fff
style H fill:#2c3e50,stroke:#fff,color:#fff
style M fill:#2c3e50,stroke:#fff,color:#fff
style W1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style W3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style W4 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style DB2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style ES fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
Temporal Server architecture: Frontend receives requests, History manages events, Matching dispatches tasks to Workers
Code Example: Order Processing Workflow
Here's an order processing workflow using the Temporal .NET SDK:
// Workflow Interface Definition
[Workflow]
public class OrderWorkflow
{
[WorkflowRun]
public async Task<OrderResult> RunAsync(Order order)
{
// Step 1: Verify payment
var paymentResult = await Workflow.ExecuteActivityAsync(
(OrderActivities act) => act.VerifyPaymentAsync(order.PaymentInfo),
new ActivityOptions { StartToCloseTimeout = TimeSpan.FromSeconds(30) });
if (!paymentResult.Success)
return OrderResult.Failed("Payment verification failed");
// Step 2: Reserve inventory — with compensation if later steps fail
await Workflow.ExecuteActivityAsync(
(OrderActivities act) => act.ReserveInventoryAsync(order.Items),
new ActivityOptions
{
StartToCloseTimeout = TimeSpan.FromMinutes(1),
RetryPolicy = new RetryPolicy { MaximumAttempts = 3 }
});
// Step 3: Send confirmation email
await Workflow.ExecuteActivityAsync(
(OrderActivities act) => act.SendConfirmationEmailAsync(order),
new ActivityOptions { StartToCloseTimeout = TimeSpan.FromSeconds(15) });
// Step 4: Create shipment — may take hours/days
var trackingId = await Workflow.ExecuteActivityAsync(
(OrderActivities act) => act.CreateShipmentAsync(order),
new ActivityOptions
{
StartToCloseTimeout = TimeSpan.FromMinutes(5),
RetryPolicy = new RetryPolicy
{
MaximumAttempts = 5,
InitialInterval = TimeSpan.FromSeconds(10),
BackoffCoefficient = 2.0
}
});
// Step 5: Wait for delivery confirmation (may take days)
var delivered = await Workflow.WaitConditionAsync(
() => _deliveryConfirmed,
timeout: TimeSpan.FromDays(14));
return delivered
? OrderResult.Completed(trackingId)
: OrderResult.DeliveryTimeout(trackingId);
}
private bool _deliveryConfirmed;
[WorkflowSignal]
public async Task ConfirmDeliveryAsync()
{
_deliveryConfirmed = true;
}
[WorkflowQuery]
public string GetStatus() => _currentStatus;
}
// Activity Implementation — where side effects live
[Activity]
public class OrderActivities
{
private readonly IPaymentGateway _payment;
private readonly IInventoryService _inventory;
private readonly IEmailService _email;
public OrderActivities(
IPaymentGateway payment,
IInventoryService inventory,
IEmailService email)
{
_payment = payment;
_inventory = inventory;
_email = email;
}
[Activity]
public async Task<PaymentResult> VerifyPaymentAsync(PaymentInfo info)
=> await _payment.ChargeAsync(info);
[Activity]
public async Task ReserveInventoryAsync(List<OrderItem> items)
=> await _inventory.ReserveAsync(items);
[Activity]
public async Task SendConfirmationEmailAsync(Order order)
=> await _email.SendOrderConfirmationAsync(order.CustomerEmail, order);
[Activity]
public async Task<string> CreateShipmentAsync(Order order)
=> await _inventory.CreateShipmentAsync(order.ShippingAddress, order.Items);
}
💡 Signals and Queries
Signals allow sending events into a running workflow from outside (e.g., a delivery confirmation webhook). Queries allow reading the current workflow state without affecting execution. Both are powerful mechanisms for interacting with long-running workflows.
Azure Durable Functions — Serverless Durable Execution
Azure Durable Functions is an extension of Azure Functions that provides durable execution in a serverless environment. It's the best fit if you're already in the Azure ecosystem and need simple-to-moderate workflow complexity.
// Orchestrator Function — equivalent to Temporal Workflow
[Function("OrderOrchestrator")]
public static async Task<OrderResult> RunOrchestrator(
[OrchestrationTrigger] TaskOrchestrationContext context)
{
var order = context.GetInput<Order>();
// Each CallActivityAsync is equivalent to a Temporal Activity
var payment = await context.CallActivityAsync<PaymentResult>(
"VerifyPayment", order.PaymentInfo);
if (!payment.Success)
return OrderResult.Failed("Payment failed");
await context.CallActivityAsync("ReserveInventory", order.Items);
await context.CallActivityAsync("SendConfirmationEmail", order);
var trackingId = await context.CallActivityAsync<string>(
"CreateShipment", order);
// Durable Timer — wait up to 14 days
using var cts = new CancellationTokenSource();
var deadline = context.CurrentUtcDateTime.AddDays(14);
var timerTask = context.CreateTimer(deadline, cts.Token);
// Wait for external event (similar to Signal in Temporal)
var deliveryEvent = context.WaitForExternalEvent<bool>("DeliveryConfirmed");
var winner = await Task.WhenAny(deliveryEvent, timerTask);
if (winner == deliveryEvent)
{
cts.Cancel();
return OrderResult.Completed(trackingId);
}
return OrderResult.DeliveryTimeout(trackingId);
}
// Activity Function
[Function("VerifyPayment")]
public static async Task<PaymentResult> VerifyPayment(
[ActivityTrigger] PaymentInfo info,
[FromServices] IPaymentGateway gateway)
{
return await gateway.ChargeAsync(info);
}
Common Durable Functions Patterns
graph LR
subgraph FC["Function Chaining"]
A1["Activity A"] --> A2["Activity B"] --> A3["Activity C"]
end
subgraph FO["Fan-out / Fan-in"]
B1["Start"] --> B2["Task 1"]
B1 --> B3["Task 2"]
B1 --> B4["Task 3"]
B2 --> B5["Aggregate"]
B3 --> B5
B4 --> B5
end
subgraph MN["Monitor"]
C1["Check"] --> C2{"Done?"}
C2 -->|No| C3["Timer"] --> C1
C2 -->|Yes| C4["Complete"]
end
style A1 fill:#e94560,stroke:#fff,color:#fff
style A2 fill:#e94560,stroke:#fff,color:#fff
style A3 fill:#e94560,stroke:#fff,color:#fff
style B1 fill:#2c3e50,stroke:#fff,color:#fff
style B5 fill:#2c3e50,stroke:#fff,color:#fff
style C1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style C4 fill:#4CAF50,stroke:#fff,color:#fff
Three core patterns: Function Chaining (sequential), Fan-out/Fan-in (parallel), Monitor (polling with timer)
Restate — Lightweight Durable Execution
Restate is an emerging engine focused on simplicity. Instead of requiring you to organize code into separate workflow/activity constructs, Restate lets you mark any function as durable with just a decorator.
// Restate SDK — TypeScript
import * as restate from "@restatedev/restate-sdk";
const orderService = restate.service({
name: "OrderService",
handlers: {
processOrder: restate.handlers.handler(
async (ctx: restate.Context, order: Order) => {
// ctx.run() creates a "journal entry" — similar to an activity
const payment = await ctx.run("verify-payment", async () => {
return await paymentGateway.charge(order.paymentInfo);
});
if (!payment.success) {
return { status: "failed", reason: "payment" };
}
await ctx.run("reserve-inventory", async () => {
await inventoryService.reserve(order.items);
});
await ctx.run("send-email", async () => {
await emailService.sendConfirmation(order);
});
// Durable sleep — survives restarts
await ctx.sleep(60_000); // wait 1 minute
const tracking = await ctx.run("create-shipment", async () => {
return await shippingService.create(order);
});
return { status: "completed", trackingId: tracking };
}
),
},
});
restate.endpoint().bind(orderService).listen(9080);
💡 What Makes Restate Different
Restate doesn't require running a complex separate cluster like Temporal. The Restate server is a lightweight single binary, and handlers are regular HTTP endpoints. It's especially well-suited for small teams that want durable execution without operating complex infrastructure.
Detailed Comparison of 3 Engines
| Criteria | Temporal | Azure Durable Functions | Restate |
|---|---|---|---|
| Hosting | Self-hosted or Temporal Cloud | Azure serverless (consumption plan) | Self-hosted (single binary) |
| Language SDKs | Go, Java, TypeScript, Python, .NET, PHP | C#, JavaScript, Python, Java, PowerShell | TypeScript, Java, Kotlin, Go, Python |
| Persistence | PostgreSQL, MySQL, Cassandra, SQLite | Azure Storage / MS SQL / Netherite | RocksDB (embedded) or external |
| Scalability | Millions of concurrent workflows | Auto-scale, limited by storage backend | Good for medium scale |
| Versioning | Worker Versioning (2026), Build ID | Task hub versioning, code conditions | Service versioning via deployment |
| Observability | Web UI, Metrics, built-in Tracing | Azure Monitor, Application Insights | OpenTelemetry, Admin API |
| Operational Complexity | Medium — requires cluster + DB | Low — fully managed | Low — single binary |
| Pricing | OSS free, Cloud pay-per-action | Pay-per-execution (serverless) | OSS free, Cloud option |
| Best For | Enterprise, high-throughput, multi-cloud | Azure-native, serverless workloads | Small teams, startups, lightweight needs |
When Do You Need Durable Execution?
Durable Execution is not a silver bullet. Here's a checklist to help you decide:
✅ Use it when
- Long-running processes: Workflows spanning hours, days, or weeks (order processing, onboarding, subscription billing)
- Complex saga patterns: Compensating transactions across multiple services with clear rollback logic
- Human-in-the-loop: Workflows awaiting user approval (leave requests, expense reports)
- Complex scheduled jobs: Stateful cron jobs needing retry, monitoring (data pipelines, report generation)
- Reliable async operations: Calling unreliable external APIs with retry + timeout + fallback requirements
❌ Don't use it when
- Simple request-response: APIs returning results in milliseconds — overhead isn't worth it
- Stateless processing: Event processing without state tracking (log aggregation, simple ETL)
- Strict real-time: Latency requirements < 10ms — replay mechanism adds overhead
- Team not ready: The learning curve is steep, especially the deterministic constraint
Production Best Practices
1. Design Activities Correctly
// ❌ WRONG: Activity too large, doing too many things
[Activity]
public async Task ProcessEntireOrder(Order order)
{
await VerifyPayment(order);
await ReserveInventory(order);
await SendEmail(order); // If this fails, must retry EVERYTHING
await CreateShipment(order);
}
// ✅ RIGHT: Each Activity is an independent retry unit
[Activity]
public async Task<PaymentResult> VerifyPayment(PaymentInfo info) { ... }
[Activity]
public async Task ReserveInventory(List<OrderItem> items) { ... }
[Activity]
public async Task SendEmail(Order order) { ... }
2. Ensure Activity Idempotency
Activities can be retried at any time (network timeout after successful execution). Every activity must be idempotent — running it multiple times produces the same result.
[Activity]
public async Task<PaymentResult> ChargePayment(string orderId, decimal amount)
{
// Use idempotency key to prevent double-charging
var idempotencyKey = $"order-payment-{orderId}";
return await _paymentGateway.ChargeAsync(amount, idempotencyKey);
}
3. Safe Workflow Versioning
graph TB
subgraph V1["Version 1 (running)"]
A["Step A"] --> B["Step B"] --> C["Step C"]
end
subgraph V2["Version 2 (newly deployed)"]
D["Step A"] --> E["Step B"] --> F["Step B2 (new)"] --> G["Step C"]
end
subgraph Strategy["Strategy"]
S1["New workflows → V2"]
S2["Running workflows → V1 until completion"]
end
style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style B fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style C fill:#f8f9fa,stroke:#e94560,color:#2c3e50
style D fill:#e94560,stroke:#fff,color:#fff
style E fill:#e94560,stroke:#fff,color:#fff
style F fill:#4CAF50,stroke:#fff,color:#fff
style G fill:#e94560,stroke:#fff,color:#fff
style S1 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
style S2 fill:#f8f9fa,stroke:#2c3e50,color:#2c3e50
Worker Versioning: running workflows continue on old code, new workflows use new code
4. Testing Workflows
// Temporal provides a test framework to test workflows without a server
[Fact]
public async Task OrderWorkflow_SuccessfulPayment_CompletesOrder()
{
var env = await WorkflowEnvironment.StartTimeSkippingAsync();
var worker = new TemporalWorker(env.Client, new TemporalWorkerOptions("test-queue")
.AddWorkflow<OrderWorkflow>()
.AddAllActivities(new OrderActivities(
mockPayment.Object,
mockInventory.Object,
mockEmail.Object)));
await worker.ExecuteAsync(async () =>
{
var result = await env.Client.ExecuteWorkflowAsync(
(OrderWorkflow wf) => wf.RunAsync(testOrder),
new WorkflowOptions { Id = "test-order-1", TaskQueue = "test-queue" });
Assert.Equal("completed", result.Status);
});
}
// TimeSkipping environment: Workflow.Sleep(14 days)
// actually runs in milliseconds during testing
Conclusion
Durable Execution is transforming how we build distributed systems. Instead of spending 80% of effort on infrastructure code (retry, state management, recovery), you focus 100% on business logic and let the platform handle the rest.
Choose Temporal if you need large scale, multi-cloud, and your team has the resources to operate it. Choose Azure Durable Functions if you're already in the Azure ecosystem and prioritize serverless. Choose Restate if you're a small team, need speed, and don't yet need massive scale.
Whichever engine you choose, remember: Activities must be idempotent, Workflows must be deterministic, and always have a versioning strategy before going to production.
References
EF Core 10 Deep Dive: Vector Search, JSON Type, Named Filters & LeftJoin
WebTransport API: The Next-Gen Transport Protocol Beyond WebSocket
Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.