Distributed Consensus

The Problem

In a single-process system, "agree on a value" is trivial. In a distributed system, any node can fail mid-operation, messages can be delayed or reordered, and you cannot distinguish a slow node from a dead one. Distributed consensus is the formal problem of achieving agreement under these conditions without producing contradictions.

Why It's Hard: The Two Failure Modes That Matter

Split-brain occurs when a network partition causes two groups of nodes to each believe they're the majority and elect separate leaders. Both proceed to accept writes, creating two divergent histories that are difficult or impossible to reconcile. This is the failure mode that causes data corruption, not just data loss.

Livelock occurs when nodes keep deferring to each other and no progress is ever made. Paxos is theoretically susceptible to this. In practice it's rare, but it shaped the design of Raft.

Paxos

Paxos is the original consensus algorithm, proven correct by Leslie Lamport in 1989. It works in two phases: Prepare (a proposer gets promises from a majority of acceptors not to accept older proposals) and Accept (the proposer sends its value; acceptors commit if no higher proposal has arrived). The correctness proof is tight. The implementation is notoriously difficult. Google's Chubby team documented this extensively. Multi-Paxos (Paxos for a continuous log, not a single value) is what systems actually use, but it requires significant additional engineering.

Raft

Raft was designed explicitly for understandability. It decomposes consensus into three sub-problems: leader election, log replication, and safety. One node is always the leader; all writes go through it. Followers replicate the leader's log. If the leader fails, a new election occurs. A node wins by getting votes from a majority (quorum) of the cluster.

The key invariant: a log entry is committed only when a majority of nodes have written it. This guarantees durability even if the leader crashes immediately after committing.

Raft is implemented in etcd (Kubernetes' config store), CockroachDB, TiKV, and Consul.

Quorum

For a cluster of n nodes, quorum requires floor(n/2) + 1 nodes to agree. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2. This is why consensus clusters are almost always odd-numbered: even-numbered clusters don't gain additional fault tolerance.

Where It Shows Up in System Design

Leader election in Kafka, Elasticsearch, and most distributed databases
Distributed locks (etcd, ZooKeeper) for coordinating access to shared resources
Config and service discovery (Consul, etcd)
Strongly consistent reads in CockroachDB and Spanner

Interview Tip

Most candidates know Paxos exists but can't explain it. Interviewers at L6+ expect you to explain the quorum requirement and why split-brain is the critical failure mode to prevent, not just "nodes need to agree." Being able to describe Raft's leader election and log replication at a mechanistic level, and knowing where it's actually deployed, separates strong answers from surface-level ones.