The Lineage of Consensus Algorithms

When designing distributed systems, the problem of "getting multiple nodes to agree on a single value" is inescapable. Database replication, cross-service transactions, blockchain ledger management—at the core of all of these lies the consensus problem.

In this article, we trace the history of consensus algorithms, using interactive simulators to experience the evolution from 2PC → Paxos → Raft → PBFT firsthand. The goal is to understand the why behind each algorithm.

Let's start with a bird's-eye view of when the major consensus algorithms appeared.

Starting with 2PC in 1978, through the formulation of the Byzantine Generals Problem in 1982, and on to Paxos, PBFT, and Raft. Notice the two distinct tracks: crash fault tolerance on the left, and Byzantine fault tolerance on the right.

Fault Models — Crash Faults vs Byzantine Faults

Before diving into consensus algorithms, let's clarify the fault models.

Crash Fault: A node suddenly stops and never responds again. After recovery, it behaves correctly. It never lies.
Byzantine Fault: A node can behave arbitrarily. It can send false messages, tamper with data, or simply ignore requests—anything goes.

Crash faults are a special case of Byzantine faults. An algorithm that tolerates Byzantine faults also tolerates crash faults, but requires more nodes and generates more messages.

Fault Model	Required Nodes	Node Behavior
Crash Fault	2f + 1 (tolerates f failures)	Only stops; never lies
Byzantine Fault	3f + 1 (tolerates f failures)	Can behave arbitrarily

The FLP Impossibility Theorem — Perfect Consensus Cannot Exist

In 1985, Fischer, Lynch, and Paterson proved a groundbreaking result.

FLP Impossibility Theorem: In an asynchronous model where even one node can crash, no deterministic algorithm can simultaneously guarantee Safety, Liveness, and Fault Tolerance.

This doesn't mean "consensus algorithms are impossible." Practical algorithms circumvent FLP through:

Randomization: Introducing randomness in proposer election (probabilistic liveness guarantee)
Partial Synchrony Model: Assuming messages eventually arrive (GST: Global Stabilization Time)
Failure Detectors: Using imperfect failure detection via timeouts

Paxos and Raft always guarantee safety, and guarantee liveness "eventually"—that is, after sufficient time has passed.

2PC — The Simplest Agreement

Two-Phase Commit (2PC) is the foundational protocol for distributed transactions. A Coordinator obtains agreement from all participants through two phases: Prepare → Commit.

Problems with 2PC

2PC is simple but has critical weaknesses:

Blocking: If the Coordinator crashes, participants that already voted Yes wait indefinitely for the Commit/Abort decision. Timing out with an unilateral Abort risks inconsistency with other participants.
Zero Fault Tolerance: A single Coordinator crash halts the entire system. It has fault tolerance of f=0.
Single Point of Failure: All authority is centralized in the Coordinator.

To address these problems, Paxos was born.

Paxos — The Theoretical Foundation of Consensus

Paxos, proposed by Leslie Lamport in 1989, became the theoretical foundation for crash-fault-tolerant algorithms. It's used in Google's Chubby lock service and Spanner.

Paxos's Three Roles

Proposer: Proposes a value. Each proposal carries a unique proposal number.
Acceptor: Decides whether to accept a proposal. Agreement is reached when a majority accepts the same value.
Learner: Receives the agreed-upon value and applies it to the state machine. This article focuses on the interaction between Proposers and Acceptors.

Single-Decree Paxos Protocol

Paxos operates in two phases (similar to 2PC, but fundamentally different in that it relies on a majority).

Phase 1 — Prepare/Promise:

Proposer selects a proposal number n and sends Prepare(n) to all Acceptors
If n is the highest seen, an Acceptor replies with Promise(n, previously accepted value)

Phase 2 — Accept/Accepted:

After receiving a majority of Promises, Proposer sends Accept(n, v)
- If any Promise includes a previously accepted value, it must choose the value with the highest proposal number
- Otherwise, it proposes its own value
If the Acceptor hasn't made a Promise for a higher number, it replies Accepted(n, v)

Try the simulator to walk through each phase.

Paxos Simulator

Step through Single-Decree Paxos with different scenarios.

Scenario:

PreparePromiseAcceptAccepted

Initial state: N1 is the Proposer, N2–N5 are Acceptors. N1 starts consensus with proposal number n=1.

[1] N1 begins consensus as Proposer

Step 1 / 51 / 5

Why Paxos Is Safe

The core of Paxos's safety lies in ordering by proposal number.

Acceptors reject Accept for any proposal number smaller than the highest Promise they've made
Proposers respect previously accepted values included in Promises

These two constraints guarantee that once a majority has accepted a value, no future proposal can override it. This is the Paxos safety theorem.

Challenges with Paxos

Paxos is correct, but difficult to implement.

Hard to Understand: Lamport's original paper "The Part-Time Parliament" used an unusual allegory about an ancient Greek parliament, confusing many researchers. He later rewrote it as "Paxos Made Simple".
Non-Obvious Extension to Multi-Paxos: Single-Decree Paxos only decides one value. Extending it for practical log replication (Multi-Paxos) is not prescribed in the paper, leading to divergent implementations.
Liveness Issues (Livelock): Two Proposers can compete with ever-increasing proposal numbers indefinitely, preventing agreement (see the "Proposer Conflict" scenario in the simulator).

Raft — Redesigned for Understandability

Raft, published by Diego Ongaro and John Ousterhout in 2014, was designed with the explicit goal of being "as safe as Paxos, but far easier to understand."

Raft's Design Philosophy

Raft solved Paxos's problems through two approaches:

Problem Decomposition: Clearly separating consensus into Leader Election, Log Replication, and Safety.
Strong Leader: All writes go through the Leader, with one-directional replication from Leader → Followers.

Raft's State Transitions

Each node has three states: Follower, Candidate, and Leader.

Leader Election

All nodes start as Followers
A node whose election timeout expires becomes a Candidate and increments its term
It votes for itself and sends RequestVote RPCs to other nodes
Upon receiving a majority of votes, it becomes Leader
The Leader periodically sends heartbeats (empty AppendEntries) to reset election timeouts

Election Restriction: A voting node only grants its vote if the candidate's log is at least as up-to-date as its own. This prevents nodes without committed entries from becoming Leader.

Log Replication

The Leader appends client writes to its log
Sends log entries to all Followers via AppendEntries RPC
Once a majority responds → commit (apply to state machine)

Leader Completeness Property: Once committed, an entry is guaranteed to appear in the log of every future Leader. This is the core of Raft's safety.

Raft Simulator

Visualize Raft's leader election, log replication, and fault tolerance.

Scenario:

Replicated Log

empty

RequestVoteVoteAppendEntries

Initial: All nodes are Followers at term=0. No Leader yet. Each node has an election timeout; when it expires, the node becomes a Candidate.

[1] All nodes start as Followers

Step 1 / 51 / 5

Raft vs Paxos

Aspect	Paxos	Raft
Leader role	Optimization	Required
Log flow	Any Proposer → Acceptor	Leader → Follower (unidirectional)
Understandability	Notoriously difficult	A primary design goal
Safety proof	Equivalent	Equivalent
Adoption	Chubby, Spanner	etcd, CockroachDB, TiKV

PBFT — Byzantine Fault Tolerance

Paxos and Raft only handle crash faults. In environments where nodes can "lie," a more powerful protocol is needed.

In 1999, Miguel Castro and Barbara Liskov proposed PBFT (Practical Byzantine Fault Tolerance), a groundbreaking practical algorithm that tolerates Byzantine faults.

The PBFT Protocol

PBFT operates in three phases.

Pre-prepare: The Primary (equivalent to a leader) assigns a sequence number to the request and sends it to all Replicas
Prepare: Each node sends a Prepare message to all other nodes (all-to-all communication)
Commit: Once a node collects $2f+1$ Prepare messages, it sends a Commit message to all other nodes

The Prepare and Commit phases each require collecting $2f+1$ messages, ensuring that honest nodes maintain a majority even with $f$ Byzantine nodes.

Why 3f+1 Is Required

With $n$ total nodes and $f$ Byzantine nodes, there are $n - f$ honest nodes. To reach the $2f+1$ threshold using only honest nodes:

$n - f \geq 2f + 1 \implies n \geq 3f + 1$

With 4 nodes, the system tolerates $f=1$ ; with 7 nodes, $f=2$ Byzantine failures.

PBFT Simulator

Visualize Practical Byzantine Fault Tolerance with honest and Byzantine nodes.

Scenario:

Pre-preparePrepareCommitByzantine msg

PBFT Normal case: 4 nodes (n=3f+1, f=1). N0 is Primary, N1–N3 are Replicas. Client request arrives.

[1] 4-node PBFT (f=1): tolerates up to 1 Byzantine fault

Step 1 / 41 / 4

Challenges with PBFT

$O(n^2)$ Message Complexity: All-to-all communication means message volume grows quadratically with node count
Scalability: Practical limit is on the order of tens of nodes
View Change Complexity: The protocol for replacing a faulty Primary is complex

Algorithm Comparison

Algorithm	Fault Model	Tolerance	Msg Complexity	Latency	Use Case
2PC	Crash	0	O(n)	2 rounds	Databases
Paxos	Crash	f < n/2	O(n)	2 rounds	Chubby, Spanner
Raft	Crash	f < n/2	O(n)	2 rounds	etcd, CockroachDB
PBFT	Byzantine	f < n/3	O(n²)	3 rounds	Blockchain

Real-World Adoption

Examining where each algorithm is used reveals the relationship between theory and practice.

Paxos Family

Google Chubby: Distributed lock service using Multi-Paxos.
Google Spanner: Globally distributed database using Paxos for replication.
Apache ZooKeeper: Uses Zab protocol (a Paxos variant).

Raft Family

etcd: Kubernetes backend store. Uses Raft for consistency.
CockroachDB: Distributed SQL database. Each Range is a Raft group.
TiKV: Distributed KV store (storage layer of TiDB). Uses Raft.
Consul: Service mesh control plane. Uses Raft for consensus.

PBFT Family

Hyperledger Fabric: Enterprise blockchain. Early versions used PBFT; since v1.4.1 uses Raft, and since v3.0 uses SmartBFT (based on BFT-SMART).
Tendermint: Blockchain consensus engine. A PBFT variant.

Summary

Looking back at the evolution of consensus algorithms, several important patterns emerge.

Expanding Fault Models: From crash faults → Byzantine faults. The stronger the faults to tolerate, the more nodes and messages are required.
The Importance of Understandability: Paxos's difficulty gave rise to Raft. Correctness alone isn't enough—ease of understanding and implementation is a critical design metric.
The Gap Between Theory and Practice: The FLP impossibility theorem says "perfection is impossible," but real systems work well enough using timeouts and randomization.
Choosing Tradeoffs: Always guarantee safety; guarantee liveness "eventually"—this is the common strategy of modern consensus algorithms.

Consensus in distributed systems appears simple on the surface, but it's a deep field with over 40 years of accumulated research. If the simulators in this article helped convey the why behind each algorithm, even a little, that's a success.

References

Lamport, L. "The Part-Time Parliament". ACM TOCS, 1998.
Lamport, L. "Paxos Made Simple". 2001.
Ongaro, D. and Ousterhout, J. "In Search of an Understandable Consensus Algorithm". USENIX ATC, 2014.
Castro, M. and Liskov, B. "Practical Byzantine Fault Tolerance". OSDI, 1999.
Fischer, M., Lynch, N., and Paterson, M. "Impossibility of Distributed Consensus with One Faulty Process". JACM, 1985.
Lamport, L., Shostak, R., and Pease, M. "The Byzantine Generals Problem". ACM TOPLAS, 1982.