Saga Pattern
A pattern for managing distributed transactions as a sequence of local transactions, each with a compensating transaction for rollback when a step fails.
The Alternative to 2PC
Two-phase commit guarantees atomicity but introduces blocking failures and lock-holding latency that makes it impractical for long-running or high-throughput distributed transactions. The Saga pattern accepts a different trade: instead of atomic commits across services, it decomposes the transaction into a sequence of independent local transactions. If a step fails, previously completed steps are reversed through compensating transactions. The system reaches consistency eventually rather than atomically.
How It Works: The E-commerce Example
An order placement involves three steps: reserve inventory, charge payment, confirm the order. Each step is a local transaction on a single service and database. If all three succeed, the order is placed. If the payment charge fails after inventory is reserved, the saga executes a compensating transaction on the inventory service to release the reservation. The compensating transactions undo the effects of completed steps in reverse order.
The key constraint on compensating transactions: they must be idempotent. If the message to release inventory is delivered twice (due to a retry), the inventory service must handle it correctly without releasing twice. Idempotency keys, event deduplication, or conditional updates ensure this.
Choreography vs Orchestration
Sagas have two implementation styles:
Choreography: each service publishes an event when its local transaction completes. Downstream services listen for the event and trigger their own step. There is no central coordinator. The flow emerges from event subscriptions. This is loosely coupled and scalable, but the overall transaction flow is implicit and spread across services. Debugging a failed saga requires tracing events across multiple logs.
Orchestration: a central Saga Orchestrator service explicitly calls each step in sequence. It maintains the saga state and decides what to do on success or failure. The flow is explicit and visible in one place. This is easier to reason about and monitor, but the orchestrator becomes a coupling point and a potential bottleneck.
Intermediate State Visibility
The fundamental difference from 2PC: intermediate states are visible. Between "inventory reserved" and "payment charged," the inventory shows a reservation that may or may not become a confirmed order. External systems observing the database during the saga see a partially completed transaction. Applications must be designed for this. Customers may briefly see "reserved" inventory that is later released. Business logic must accommodate these transient states.
Failure Handling
Compensating transactions are not always straightforward. Some operations are non-reversible: if a confirmation email was sent, you cannot un-send it. Sagas typically handle this with notifications rather than true reversal: a cancellation email replaces the undo. Compensation logic adds meaningful development overhead and must be tested against every failure permutation.
Saga State Persistence
A saga that spans multiple services and takes minutes to complete must persist its state. If the orchestrator crashes mid-saga, it must be able to resume from the last completed step without re-executing completed steps (which would double-charge a payment, for example). The orchestrator stores saga state in a durable store (a database table or event log) with each step transition persisted before executing the next step. This is the outbox pattern applied to saga coordination: write the state transition and the outbound message in a single local transaction, then send the message.
When to Use Sagas vs 2PC
Use sagas when: transactions span multiple services with independent databases, long-running business processes (minutes to hours) make lock-holding intolerable, and eventual consistency is acceptable to the business. Use 2PC (or avoid distributed transactions entirely) when: transactions are short, involve a small number of participants that support XA, and strict atomicity is a hard requirement. Most modern system designs avoid both by designing around the distributed transaction problem entirely: idempotent operations, eventual consistency, and domain model boundaries aligned with service boundaries so that transactions rarely cross service lines.
Interview Tip
The question that tests real understanding: "What happens if a compensating transaction itself fails?" This is not a trick question; it is a real operational problem. The answer requires a recovery strategy: compensating transactions must be retried (they must be idempotent), failed saga state must be persisted and monitored (a dead-letter queue or saga state machine with a failed terminal state), and operations may need manual intervention for truly un-compensatable failures. Candidates who answer "just retry" without addressing idempotency, and candidates who claim compensation always succeeds, are both signaling they have not built sagas in production. The L6 addition: explain that choreography-based sagas are harder to monitor and debug because the transaction flow is implicit, making the orchestration style preferable for high-stakes business processes where observability of the saga state machine is a requirement.
Related Concepts
Asynchronous communication buffer between services. Decouples producers from consumers and provides durability during traffic spikes.
An architectural pattern where application state is derived by replaying an append-only log of events rather than storing and mutating current state directly.
A distributed coordination protocol that ensures all participants in a transaction either commit or abort atomically, using a prepare phase followed by a commit phase.