Reliability Advanced 5 min read

The Saga Pattern: Distributed Transactions Without 2PC

How to coordinate multi-step business workflows across .NET services. Choreography vs orchestration, MassTransit sagas, and compensation events.

Table of contents
  1. When does a saga become unavoidable?
  2. What numbers and trade-offs should I budget for?
  3. What does the saga shape look like?
  4. What is the .NET 10 wiring with MassTransit Saga State Machine?
  5. What failure modes do sagas introduce?
  6. When is a saga overkill?
  7. Where should you go from here?

The single hardest reliability problem in microservices is keeping data consistent across services without a distributed transaction. The saga pattern is the answer the industry settled on - a chain of local transactions plus compensating undo steps. This chapter shows how to implement sagas in .NET with MassTransit, when to choose choreography vs orchestration, and the failure modes that decide each.

When does a saga become unavoidable?

Three signals.

A workflow crosses service boundaries. Order placement touches Inventory, Payment, and Shipping services - each owns its database. You cannot wrap them in one transaction; 2PC is operationally fragile and almost no managed services support it.

Partial failure is unacceptable. Charging the card without reserving inventory is a refund-and-apologise email; reserving inventory without charging is lost revenue. The system must end in "all done" or "all undone".

Steps are slow or asynchronous. Sending a fulfilment request to a third-party warehouse takes hours; a synchronous transaction is not even possible. The saga's per-step transactions decouple time.

If your workflow lives entirely inside one database, use a single EF Core transaction and skip this chapter.

What numbers and trade-offs should I budget for?

Property                  Choreography           Orchestration
Coupling                  loose (events)          tighter (coordinator)
Visibility                spread across services  one log
Adding a step             change >= 2 services    change orchestrator
Debug complexity          high (event web)        low (state machine)
Team autonomy             high                    moderate

Most .NET teams should start with orchestration. The "single state machine you can read" beats the "decentralised event web" for mid-sized systems. Graduate to choreography when service teams are big enough that coordinating orchestrator changes becomes a bottleneck.

What does the saga shape look like?

Orchestration with explicit coordinator:

sequenceDiagram
    participant C as Client
    participant S as OrderSaga
    participant I as Inventory
    participant P as Payment
    participant Sh as Shipping
    C->>S: PlaceOrder
    S->>I: ReserveInventory
    I-->>S: Reserved
    S->>P: ChargeCard
    P-->>S: Charged
    S->>Sh: Ship
    Sh-->>S: Shipped
    S-->>C: OrderCompleted

If ChargeCard fails, the saga issues ReleaseInventory to compensate. If Ship fails, it issues both RefundCard and ReleaseInventory. The orchestrator owns the rules.

What is the .NET 10 wiring with MassTransit Saga State Machine?

public class OrderSagaState : SagaStateMachineInstance
{
    public Guid CorrelationId { get; set; }
    public string CurrentState { get; set; } = "";
    public Guid OrderId { get; set; }
    public Guid UserId { get; set; }
    public decimal Amount { get; set; }
    public Guid? ReservationId { get; set; }
    public Guid? ChargeId { get; set; }
}

public class OrderSaga : MassTransitStateMachine<OrderSagaState>
{
    public State Reserving { get; private set; } = null!;
    public State Charging { get; private set; } = null!;
    public State Shipping { get; private set; } = null!;
    public State Completed { get; private set; } = null!;
    public State Failed { get; private set; } = null!;

    public Event<PlaceOrder> Started { get; private set; } = null!;
    public Event<InventoryReserved> InventoryReserved { get; private set; } = null!;
    public Event<InventoryReservationFailed> InventoryFailed { get; private set; } = null!;
    public Event<PaymentCharged> PaymentCharged { get; private set; } = null!;
    public Event<PaymentChargeFailed> PaymentFailed { get; private set; } = null!;

    public OrderSaga()
    {
        InstanceState(x => x.CurrentState);

        Initially(
            When(Started)
                .Then(ctx => { ctx.Saga.OrderId = ctx.Message.OrderId;
                                ctx.Saga.Amount = ctx.Message.Amount; })
                .Publish(ctx => new ReserveInventory(ctx.Saga.OrderId, ctx.Message.Items))
                .TransitionTo(Reserving));

        During(Reserving,
            When(InventoryReserved)
                .Then(ctx => ctx.Saga.ReservationId = ctx.Message.ReservationId)
                .Publish(ctx => new ChargeCard(ctx.Saga.OrderId, ctx.Saga.Amount))
                .TransitionTo(Charging),
            When(InventoryFailed)
                .TransitionTo(Failed));

        During(Charging,
            When(PaymentCharged)
                .Publish(ctx => new ShipOrder(ctx.Saga.OrderId))
                .TransitionTo(Shipping),
            When(PaymentFailed)
                .Publish(ctx => new ReleaseInventory(ctx.Saga.ReservationId!.Value))  // compensate
                .TransitionTo(Failed));

        // ... Shipping, Completed states omitted for brevity
    }
}

Three details. The state machine is the documentation of the workflow; new engineers read it and understand the system. Persistence to EF Core is a one-line registration; on crash, the orchestrator resumes from the last persisted state. Compensation is just another event publish - the same MassTransit infrastructure handles it.

What failure modes do sagas introduce?

Chapter 13 emits saga state-transition events as OpenTelemetry traces - the entire workflow is one Jaeger span.

When is a saga overkill?

Three smells.

One: a single database write disguised as a workflow. If the "steps" all live in one DbContext, a transaction is the right shape, not a saga.

Two: a workflow with strict synchronous timing. If the user is waiting for the result and every step must complete in milliseconds, a saga's eventual consistency is the wrong model. A direct call chain (with resilience handlers from chapter 11) is simpler.

Three: when compensation is impossible. "Send notification" has no inverse - you cannot un-send. If the workflow contains non-compensatable steps, restructure so they run last (no later step can fail) or accept that those steps may run in failure states.

Where should you go from here?

You have completed the reliability group. Next chapter: OpenTelemetry observability in .NET - the metrics, traces, and logs that let you see whether your sagas, breakers, and outboxes are actually working in production.

Frequently asked questions

When does a saga earn its complexity?
When you have a business workflow that spans more than one service and needs all-or-nothing semantics, but two-phase commit (2PC) is not available. Examples: place order → reserve inventory → charge card → ship package. Each step lives in a different service; if charging fails, you must release inventory. The saga is the only practical answer at scale.
Choreography or orchestration?
Orchestration when the workflow is complex, has many steps, or changes often - one place to read the full sequence. Choreography when the services are owned by different teams and you want loose coupling - each service publishes events and subscribes to others. Most production .NET systems start orchestrated and graduate to choreography only when team scale forces it.
What is a compensating transaction?
The undo of a step. ChargeCard → RefundCard. ReserveInventory → ReleaseInventory. Compensation is not the database rollback - the original transaction has already committed. Compensation is a new business operation that brings the system to a consistent state. The semantics matter: a refund is not the same as never charging, and the user may see both.
How do I make sagas idempotent?
Each step's handler must be idempotent - covered in chapter 10. Each compensation must also be idempotent. The saga state machine itself should be persisted so a crashed orchestrator can resume from the last known step. MassTransit's saga state machines persist to a database (Entity Framework or Redis) and handle crash recovery automatically.