The Lineage of Consensus Algorithms — From 2PC to Byzantine Fault Tolerance, a Visual Journey Through Distributed Consensus
Visualize 2PC, Paxos, Raft, and PBFT through interactive simulators while tracing the evolution and theoretical foundations of distributed consensus algorithms.
When designing distributed systems, the problem of "getting multiple nodes to agree on a single value" is inescapable. Database replication, cross-service transactions, blockchain ledger management—at the core of all of these lies the consensus problem.
In this article, we trace the history of consensus algorithms, using interactive simulators to experience the evolution from 2PC → Paxos → Raft → PBFT firsthand. The goal is to understand the why behind each algorithm.
The Lineage of Consensus Algorithms
Let's start with a bird's-eye view of when the major consensus algorithms appeared.
Starting with 2PC in 1978, through the formulation of the Byzantine Generals Problem in 1982, and on to Paxos, PBFT, and Raft. Notice the two distinct tracks: crash fault tolerance on the left, and Byzantine fault tolerance on the right.
Fault Models — Crash Faults vs Byzantine Faults
Before diving into consensus algorithms, let's clarify the fault models.
- Crash Fault: A node suddenly stops and never responds again. After recovery, it behaves correctly. It never lies.
- Byzantine Fault: A node can behave arbitrarily. It can send false messages, tamper with data, or simply ignore requests—anything goes.
Crash faults are a special case of Byzantine faults. An algorithm that tolerates Byzantine faults also tolerates crash faults, but requires more nodes and generates more messages.
| Fault Model | Required Nodes | Node Behavior |
|---|---|---|
| Crash Fault | 2f + 1 (tolerates f failures) | Only stops; never lies |
| Byzantine Fault | 3f + 1 (tolerates f failures) | Can behave arbitrarily |
The FLP Impossibility Theorem — Perfect Consensus Cannot Exist
In 1985, Fischer, Lynch, and Paterson proved a groundbreaking result.
FLP Impossibility Theorem: In an asynchronous model where even one node can crash, no deterministic algorithm can simultaneously guarantee Safety, Liveness, and Fault Tolerance.
This doesn't mean "consensus algorithms are impossible." Practical algorithms circumvent FLP through:
- Randomization: Introducing randomness in proposer election (probabilistic liveness guarantee)
- Partial Synchrony Model: Assuming messages eventually arrive (GST: Global Stabilization Time)
- Failure Detectors: Using imperfect failure detection via timeouts
Paxos and Raft always guarantee safety, and guarantee liveness "eventually"—that is, after sufficient time has passed.
2PC — The Simplest Agreement
Two-Phase Commit (2PC) is the foundational protocol for distributed transactions. A Coordinator obtains agreement from all participants through two phases: Prepare → Commit.
Problems with 2PC
2PC is simple but has critical weaknesses:
- Blocking: If the Coordinator crashes, participants that already voted Yes wait indefinitely for the Commit/Abort decision. Timing out with an unilateral Abort risks inconsistency with other participants.
- Zero Fault Tolerance: A single Coordinator crash halts the entire system. It has fault tolerance of f=0.
- Single Point of Failure: All authority is centralized in the Coordinator.
To address these problems, Paxos was born.
Paxos — The Theoretical Foundation of Consensus
Paxos, proposed by Leslie Lamport in 1989, became the theoretical foundation for crash-fault-tolerant algorithms. It's used in Google's Chubby lock service and Spanner.
Paxos's Three Roles
- Proposer: Proposes a value. Each proposal carries a unique proposal number.
- Acceptor: Decides whether to accept a proposal. Agreement is reached when a majority accepts the same value.
- Learner: Receives the agreed-upon value and applies it to the state machine. This article focuses on the interaction between Proposers and Acceptors.
Single-Decree Paxos Protocol
Paxos operates in two phases (similar to 2PC, but fundamentally different in that it relies on a majority).
Phase 1 — Prepare/Promise:
- Proposer selects a proposal number
nand sendsPrepare(n)to all Acceptors - If
nis the highest seen, an Acceptor replies withPromise(n, previously accepted value)
Phase 2 — Accept/Accepted:
- After receiving a majority of Promises, Proposer sends
Accept(n, v)- If any Promise includes a previously accepted value, it must choose the value with the highest proposal number
- Otherwise, it proposes its own value
- If the Acceptor hasn't made a Promise for a higher number, it replies
Accepted(n, v)
Try the simulator to walk through each phase.
Paxos Simulator
Step through Single-Decree Paxos with different scenarios.
Why Paxos Is Safe
The core of Paxos's safety lies in ordering by proposal number.
- Acceptors reject
Acceptfor any proposal number smaller than the highestPromisethey've made - Proposers respect previously accepted values included in Promises
These two constraints guarantee that once a majority has accepted a value, no future proposal can override it. This is the Paxos safety theorem.
Challenges with Paxos
Paxos is correct, but difficult to implement.
- Hard to Understand: Lamport's original paper "The Part-Time Parliament" used an unusual allegory about an ancient Greek parliament, confusing many researchers. He later rewrote it as "Paxos Made Simple".
- Non-Obvious Extension to Multi-Paxos: Single-Decree Paxos only decides one value. Extending it for practical log replication (Multi-Paxos) is not prescribed in the paper, leading to divergent implementations.
- Liveness Issues (Livelock): Two Proposers can compete with ever-increasing proposal numbers indefinitely, preventing agreement (see the "Proposer Conflict" scenario in the simulator).
Raft — Redesigned for Understandability
Raft, published by Diego Ongaro and John Ousterhout in 2014, was designed with the explicit goal of being "as safe as Paxos, but far easier to understand."
Raft's Design Philosophy
Raft solved Paxos's problems through two approaches:
- Problem Decomposition: Clearly separating consensus into Leader Election, Log Replication, and Safety.
- Strong Leader: All writes go through the Leader, with one-directional replication from Leader → Followers.
Raft's State Transitions
Each node has three states: Follower, Candidate, and Leader.
Leader Election
- All nodes start as Followers
- A node whose election timeout expires becomes a Candidate and increments its
term - It votes for itself and sends
RequestVoteRPCs to other nodes - Upon receiving a majority of votes, it becomes Leader
- The Leader periodically sends heartbeats (empty
AppendEntries) to reset election timeouts
Election Restriction: A voting node only grants its vote if the candidate's log is at least as up-to-date as its own. This prevents nodes without committed entries from becoming Leader.
Log Replication
- The Leader appends client writes to its log
- Sends log entries to all Followers via
AppendEntriesRPC - Once a majority responds → commit (apply to state machine)
Leader Completeness Property: Once committed, an entry is guaranteed to appear in the log of every future Leader. This is the core of Raft's safety.
Raft Simulator
Visualize Raft's leader election, log replication, and fault tolerance.
Raft vs Paxos
| Aspect | Paxos | Raft |
|---|---|---|
| Leader role | Optimization | Required |
| Log flow | Any Proposer → Acceptor | Leader → Follower (unidirectional) |
| Understandability | Notoriously difficult | A primary design goal |
| Safety proof | Equivalent | Equivalent |
| Adoption | Chubby, Spanner | etcd, CockroachDB, TiKV |
PBFT — Byzantine Fault Tolerance
Paxos and Raft only handle crash faults. In environments where nodes can "lie," a more powerful protocol is needed.
In 1999, Miguel Castro and Barbara Liskov proposed PBFT (Practical Byzantine Fault Tolerance), a groundbreaking practical algorithm that tolerates Byzantine faults.
The PBFT Protocol
PBFT operates in three phases.
- Pre-prepare: The Primary (equivalent to a leader) assigns a sequence number to the request and sends it to all Replicas
- Prepare: Each node sends a Prepare message to all other nodes (all-to-all communication)
- Commit: Once a node collects Prepare messages, it sends a Commit message to all other nodes
The Prepare and Commit phases each require collecting messages, ensuring that honest nodes maintain a majority even with Byzantine nodes.
Why 3f+1 Is Required
With total nodes and Byzantine nodes, there are honest nodes. To reach the threshold using only honest nodes:
With 4 nodes, the system tolerates ; with 7 nodes, Byzantine failures.
PBFT Simulator
Visualize Practical Byzantine Fault Tolerance with honest and Byzantine nodes.
Challenges with PBFT
- Message Complexity: All-to-all communication means message volume grows quadratically with node count
- Scalability: Practical limit is on the order of tens of nodes
- View Change Complexity: The protocol for replacing a faulty Primary is complex
Algorithm Comparison
| Algorithm | Fault Model | Tolerance | Msg Complexity | Latency | Use Case |
|---|---|---|---|---|---|
| 2PC | Crash | 0 | O(n) | 2 rounds | Databases |
| Paxos | Crash | f < n/2 | O(n) | 2 rounds | Chubby, Spanner |
| Raft | Crash | f < n/2 | O(n) | 2 rounds | etcd, CockroachDB |
| PBFT | Byzantine | f < n/3 | O(n²) | 3 rounds | Blockchain |
Real-World Adoption
Examining where each algorithm is used reveals the relationship between theory and practice.
Paxos Family
- Google Chubby: Distributed lock service using Multi-Paxos.
- Google Spanner: Globally distributed database using Paxos for replication.
- Apache ZooKeeper: Uses Zab protocol (a Paxos variant).
Raft Family
- etcd: Kubernetes backend store. Uses Raft for consistency.
- CockroachDB: Distributed SQL database. Each Range is a Raft group.
- TiKV: Distributed KV store (storage layer of TiDB). Uses Raft.
- Consul: Service mesh control plane. Uses Raft for consensus.
PBFT Family
- Hyperledger Fabric: Enterprise blockchain. Early versions used PBFT; since v1.4.1 uses Raft, and since v3.0 uses SmartBFT (based on BFT-SMART).
- Tendermint: Blockchain consensus engine. A PBFT variant.
Summary
Looking back at the evolution of consensus algorithms, several important patterns emerge.
- Expanding Fault Models: From crash faults → Byzantine faults. The stronger the faults to tolerate, the more nodes and messages are required.
- The Importance of Understandability: Paxos's difficulty gave rise to Raft. Correctness alone isn't enough—ease of understanding and implementation is a critical design metric.
- The Gap Between Theory and Practice: The FLP impossibility theorem says "perfection is impossible," but real systems work well enough using timeouts and randomization.
- Choosing Tradeoffs: Always guarantee safety; guarantee liveness "eventually"—this is the common strategy of modern consensus algorithms.
Consensus in distributed systems appears simple on the surface, but it's a deep field with over 40 years of accumulated research. If the simulators in this article helped convey the why behind each algorithm, even a little, that's a success.
References
- Lamport, L. "The Part-Time Parliament". ACM TOCS, 1998.
- Lamport, L. "Paxos Made Simple". 2001.
- Ongaro, D. and Ousterhout, J. "In Search of an Understandable Consensus Algorithm". USENIX ATC, 2014.
- Castro, M. and Liskov, B. "Practical Byzantine Fault Tolerance". OSDI, 1999.
- Fischer, M., Lynch, N., and Paterson, M. "Impossibility of Distributed Consensus with One Faulty Process". JACM, 1985.
- Lamport, L., Shostak, R., and Pease, M. "The Byzantine Generals Problem". ACM TOPLAS, 1982.