The Saga Pattern: Distributed Transactions Without 2PC
How to coordinate multi-step business workflows across .NET services. Choreography vs orchestration, MassTransit sagas, and compensation events.
Table of contents
The single hardest reliability problem in microservices is keeping data consistent across services without a distributed transaction. The saga pattern is the answer the industry settled on - a chain of local transactions plus compensating undo steps. This chapter shows how to implement sagas in .NET with MassTransit, when to choose choreography vs orchestration, and the failure modes that decide each.
When does a saga become unavoidable?
Three signals.
A workflow crosses service boundaries. Order placement touches Inventory, Payment, and Shipping services - each owns its database. You cannot wrap them in one transaction; 2PC is operationally fragile and almost no managed services support it.
Partial failure is unacceptable. Charging the card without reserving inventory is a refund-and-apologise email; reserving inventory without charging is lost revenue. The system must end in "all done" or "all undone".
Steps are slow or asynchronous. Sending a fulfilment request to a third-party warehouse takes hours; a synchronous transaction is not even possible. The saga's per-step transactions decouple time.
If your workflow lives entirely inside one database, use a single EF Core transaction and skip this chapter.
What numbers and trade-offs should I budget for?
Property Choreography Orchestration
Coupling loose (events) tighter (coordinator)
Visibility spread across services one log
Adding a step change >= 2 services change orchestrator
Debug complexity high (event web) low (state machine)
Team autonomy high moderate
Most .NET teams should start with orchestration. The "single state machine you can read" beats the "decentralised event web" for mid-sized systems. Graduate to choreography when service teams are big enough that coordinating orchestrator changes becomes a bottleneck.
What does the saga shape look like?
Orchestration with explicit coordinator:
sequenceDiagram
participant C as Client
participant S as OrderSaga
participant I as Inventory
participant P as Payment
participant Sh as Shipping
C->>S: PlaceOrder
S->>I: ReserveInventory
I-->>S: Reserved
S->>P: ChargeCard
P-->>S: Charged
S->>Sh: Ship
Sh-->>S: Shipped
S-->>C: OrderCompleted
If ChargeCard fails, the saga issues ReleaseInventory to
compensate. If Ship fails, it issues both RefundCard and
ReleaseInventory. The orchestrator owns the rules.
What is the .NET 10 wiring with MassTransit Saga State Machine?
public class OrderSagaState : SagaStateMachineInstance
{
public Guid CorrelationId { get; set; }
public string CurrentState { get; set; } = "";
public Guid OrderId { get; set; }
public Guid UserId { get; set; }
public decimal Amount { get; set; }
public Guid? ReservationId { get; set; }
public Guid? ChargeId { get; set; }
}
public class OrderSaga : MassTransitStateMachine<OrderSagaState>
{
public State Reserving { get; private set; } = null!;
public State Charging { get; private set; } = null!;
public State Shipping { get; private set; } = null!;
public State Completed { get; private set; } = null!;
public State Failed { get; private set; } = null!;
public Event<PlaceOrder> Started { get; private set; } = null!;
public Event<InventoryReserved> InventoryReserved { get; private set; } = null!;
public Event<InventoryReservationFailed> InventoryFailed { get; private set; } = null!;
public Event<PaymentCharged> PaymentCharged { get; private set; } = null!;
public Event<PaymentChargeFailed> PaymentFailed { get; private set; } = null!;
public OrderSaga()
{
InstanceState(x => x.CurrentState);
Initially(
When(Started)
.Then(ctx => { ctx.Saga.OrderId = ctx.Message.OrderId;
ctx.Saga.Amount = ctx.Message.Amount; })
.Publish(ctx => new ReserveInventory(ctx.Saga.OrderId, ctx.Message.Items))
.TransitionTo(Reserving));
During(Reserving,
When(InventoryReserved)
.Then(ctx => ctx.Saga.ReservationId = ctx.Message.ReservationId)
.Publish(ctx => new ChargeCard(ctx.Saga.OrderId, ctx.Saga.Amount))
.TransitionTo(Charging),
When(InventoryFailed)
.TransitionTo(Failed));
During(Charging,
When(PaymentCharged)
.Publish(ctx => new ShipOrder(ctx.Saga.OrderId))
.TransitionTo(Shipping),
When(PaymentFailed)
.Publish(ctx => new ReleaseInventory(ctx.Saga.ReservationId!.Value)) // compensate
.TransitionTo(Failed));
// ... Shipping, Completed states omitted for brevity
}
}
Three details. The state machine is the documentation of the workflow; new engineers read it and understand the system. Persistence to EF Core is a one-line registration; on crash, the orchestrator resumes from the last persisted state. Compensation is just another event publish - the same MassTransit infrastructure handles it.
What failure modes do sagas introduce?
- Stuck saga - one step fails repeatedly, compensation fails,
saga sits in an intermediate state. Mitigation: alert on
saga_age_seconds> 1 hour; manual recovery procedure. - Compensation loops - compensation itself fails, triggering more compensation. Mitigation: cap retry attempts on compensation; if it fails, page a human.
- Out-of-order events - InventoryReserved arrives after the saga has already given up. Mitigation: state machine ignores events not valid in current state; idempotent handlers downstream.
- Saga state drift - the state DB and the participating services disagree. Mitigation: nightly reconciliation job that compares saga states with downstream service records.
Chapter 13 emits saga state-transition events as OpenTelemetry traces - the entire workflow is one Jaeger span.
When is a saga overkill?
Three smells.
One: a single database write disguised as a workflow. If the "steps" all live in one DbContext, a transaction is the right shape, not a saga.
Two: a workflow with strict synchronous timing. If the user is waiting for the result and every step must complete in milliseconds, a saga's eventual consistency is the wrong model. A direct call chain (with resilience handlers from chapter 11) is simpler.
Three: when compensation is impossible. "Send notification" has no inverse - you cannot un-send. If the workflow contains non-compensatable steps, restructure so they run last (no later step can fail) or accept that those steps may run in failure states.
Where should you go from here?
You have completed the reliability group. Next chapter: OpenTelemetry observability in .NET - the metrics, traces, and logs that let you see whether your sagas, breakers, and outboxes are actually working in production.