Distributed Consensus
The problem of getting distributed nodes to agree on a single value despite network failures and partial outages. The theoretical foundation behind etcd, ZooKeeper, and Kafka leader election.
The Problem
In a single-process system, "agree on a value" is trivial. In a distributed system, any node can fail mid-operation, messages can be delayed or reordered, and you cannot distinguish a slow node from a dead one. Distributed consensus is the formal problem of achieving agreement under these conditions without producing contradictions.
Why It's Hard: The Two Failure Modes That Matter
Split-brain occurs when a network partition causes two groups of nodes to each believe they're the majority and elect separate leaders. Both proceed to accept writes, creating two divergent histories that are difficult or impossible to reconcile. This is the failure mode that causes data corruption, not just data loss.
Livelock occurs when nodes keep deferring to each other and no progress is ever made. Paxos is theoretically susceptible to this. In practice it's rare, but it shaped the design of Raft.
Paxos
Paxos is the original consensus algorithm, proven correct by Leslie Lamport in 1989. It works in two phases: Prepare (a proposer gets promises from a majority of acceptors not to accept older proposals) and Accept (the proposer sends its value; acceptors commit if no higher proposal has arrived). The correctness proof is tight. The implementation is notoriously difficult. Google's Chubby team documented this extensively. Multi-Paxos (Paxos for a continuous log, not a single value) is what systems actually use, but it requires significant additional engineering.
Raft
Raft was designed explicitly for understandability. It decomposes consensus into three sub-problems: leader election, log replication, and safety. One node is always the leader; all writes go through it. Followers replicate the leader's log. If the leader fails, a new election occurs. A node wins by getting votes from a majority (quorum) of the cluster.
The key invariant: a log entry is committed only when a majority of nodes have written it. This guarantees durability even if the leader crashes immediately after committing.
Raft is implemented in etcd (Kubernetes' config store), CockroachDB, TiKV, and Consul.
Quorum
For a cluster of n nodes, quorum requires floor(n/2) + 1 nodes to agree. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2. This is why consensus clusters are almost always odd-numbered: even-numbered clusters don't gain additional fault tolerance.
Where It Shows Up in System Design
- Leader election in Kafka, Elasticsearch, and most distributed databases
- Distributed locks (etcd, ZooKeeper) for coordinating access to shared resources
- Config and service discovery (Consul, etcd)
- Strongly consistent reads in CockroachDB and Spanner
Interview Tip
Most candidates know Paxos exists but can't explain it. Interviewers at L6+ expect you to explain the quorum requirement and why split-brain is the critical failure mode to prevent, not just "nodes need to agree." Being able to describe Raft's leader election and log replication at a mechanistic level, and knowing where it's actually deployed, separates strong answers from surface-level ones.
Related Concepts
A distributed hashing scheme that minimizes key remapping when nodes are added or removed.
A distributed system can only guarantee two of three: Consistency, Availability, and Partition Tolerance.
Horizontal partitioning of a database across multiple machines to distribute load beyond a single server's capacity.