All posts

The Lineage of Consensus Algorithms — From 2PC to Byzantine Fault Tolerance, a Visual Journey Through Distributed Consensus

Visualize 2PC, Paxos, Raft, and PBFT through interactive simulators while tracing the evolution and theoretical foundations of distributed consensus algorithms.

Distributed SystemsConsensusPaxosRaftPBFT

When designing distributed systems, the problem of "getting multiple nodes to agree on a single value" is inescapable. Database replication, cross-service transactions, blockchain ledger management—at the core of all of these lies the consensus problem.

In this article, we trace the history of consensus algorithms, using interactive simulators to experience the evolution from 2PC → Paxos → Raft → PBFT firsthand. The goal is to understand the why behind each algorithm.

The Lineage of Consensus Algorithms

Let's start with a bird's-eye view of when the major consensus algorithms appeared.

2PC1978GrayBGP1982Lamport+Paxos1989LamportPBFT1999Castro+LiskovRaft2014Ongaro+OusterhoutCrash fault toleranceByzantine fault tolerance

Starting with 2PC in 1978, through the formulation of the Byzantine Generals Problem in 1982, and on to Paxos, PBFT, and Raft. Notice the two distinct tracks: crash fault tolerance on the left, and Byzantine fault tolerance on the right.

Fault Models — Crash Faults vs Byzantine Faults

Before diving into consensus algorithms, let's clarify the fault models.

  • Crash Fault: A node suddenly stops and never responds again. After recovery, it behaves correctly. It never lies.
  • Byzantine Fault: A node can behave arbitrarily. It can send false messages, tamper with data, or simply ignore requests—anything goes.

Crash faults are a special case of Byzantine faults. An algorithm that tolerates Byzantine faults also tolerates crash faults, but requires more nodes and generates more messages.

Fault ModelRequired NodesNode Behavior
Crash Fault2f + 1 (tolerates f failures)Only stops; never lies
Byzantine Fault3f + 1 (tolerates f failures)Can behave arbitrarily

The FLP Impossibility Theorem — Perfect Consensus Cannot Exist

In 1985, Fischer, Lynch, and Paterson proved a groundbreaking result.

FLP Impossibility (1985)SafetyCorrect resultLivenessEventually terminatesFault ToleranceSurvives crashesCannot have all 3in async model with even 1 crash

FLP Impossibility Theorem: In an asynchronous model where even one node can crash, no deterministic algorithm can simultaneously guarantee Safety, Liveness, and Fault Tolerance.

This doesn't mean "consensus algorithms are impossible." Practical algorithms circumvent FLP through:

  1. Randomization: Introducing randomness in proposer election (probabilistic liveness guarantee)
  2. Partial Synchrony Model: Assuming messages eventually arrive (GST: Global Stabilization Time)
  3. Failure Detectors: Using imperfect failure detection via timeouts

Paxos and Raft always guarantee safety, and guarantee liveness "eventually"—that is, after sufficient time has passed.

2PC — The Simplest Agreement

Two-Phase Commit (2PC) is the foundational protocol for distributed transactions. A Coordinator obtains agreement from all participants through two phases: Prepare → Commit.

CoordinatorParticipant AParticipant BPhase 1PrepareVote YesPhase 2CommitAck

Problems with 2PC

2PC is simple but has critical weaknesses:

  1. Blocking: If the Coordinator crashes, participants that already voted Yes wait indefinitely for the Commit/Abort decision. Timing out with an unilateral Abort risks inconsistency with other participants.
  2. Zero Fault Tolerance: A single Coordinator crash halts the entire system. It has fault tolerance of f=0.
  3. Single Point of Failure: All authority is centralized in the Coordinator.

To address these problems, Paxos was born.

Paxos — The Theoretical Foundation of Consensus

Paxos, proposed by Leslie Lamport in 1989, became the theoretical foundation for crash-fault-tolerant algorithms. It's used in Google's Chubby lock service and Spanner.

Paxos's Three Roles

  • Proposer: Proposes a value. Each proposal carries a unique proposal number.
  • Acceptor: Decides whether to accept a proposal. Agreement is reached when a majority accepts the same value.
  • Learner: Receives the agreed-upon value and applies it to the state machine. This article focuses on the interaction between Proposers and Acceptors.

Single-Decree Paxos Protocol

Paxos operates in two phases (similar to 2PC, but fundamentally different in that it relies on a majority).

Phase 1 — Prepare/Promise:

  1. Proposer selects a proposal number n and sends Prepare(n) to all Acceptors
  2. If n is the highest seen, an Acceptor replies with Promise(n, previously accepted value)

Phase 2 — Accept/Accepted:

  1. After receiving a majority of Promises, Proposer sends Accept(n, v)
    • If any Promise includes a previously accepted value, it must choose the value with the highest proposal number
    • Otherwise, it proposes its own value
  2. If the Acceptor hasn't made a Promise for a higher number, it replies Accepted(n, v)

Try the simulator to walk through each phase.

Paxos Simulator

Step through Single-Decree Paxos with different scenarios.

N1ProposerN2N3N4N5
PreparePromiseAcceptAccepted
Initial state: N1 is the Proposer, N2–N5 are Acceptors. N1 starts consensus with proposal number n=1.
[1] N1 begins consensus as Proposer
1 / 5

Why Paxos Is Safe

The core of Paxos's safety lies in ordering by proposal number.

  • Acceptors reject Accept for any proposal number smaller than the highest Promise they've made
  • Proposers respect previously accepted values included in Promises

These two constraints guarantee that once a majority has accepted a value, no future proposal can override it. This is the Paxos safety theorem.

Challenges with Paxos

Paxos is correct, but difficult to implement.

  1. Hard to Understand: Lamport's original paper "The Part-Time Parliament" used an unusual allegory about an ancient Greek parliament, confusing many researchers. He later rewrote it as "Paxos Made Simple".
  2. Non-Obvious Extension to Multi-Paxos: Single-Decree Paxos only decides one value. Extending it for practical log replication (Multi-Paxos) is not prescribed in the paper, leading to divergent implementations.
  3. Liveness Issues (Livelock): Two Proposers can compete with ever-increasing proposal numbers indefinitely, preventing agreement (see the "Proposer Conflict" scenario in the simulator).

Raft — Redesigned for Understandability

Raft, published by Diego Ongaro and John Ousterhout in 2014, was designed with the explicit goal of being "as safe as Paxos, but far easier to understand."

Raft's Design Philosophy

Raft solved Paxos's problems through two approaches:

  1. Problem Decomposition: Clearly separating consensus into Leader Election, Log Replication, and Safety.
  2. Strong Leader: All writes go through the Leader, with one-directional replication from Leader → Followers.

Raft's State Transitions

Each node has three states: Follower, Candidate, and Leader.

FollowerCandidateLeadertimeoutwins electionloses / higher termdiscovers higher termsplit vote

Leader Election

  1. All nodes start as Followers
  2. A node whose election timeout expires becomes a Candidate and increments its term
  3. It votes for itself and sends RequestVote RPCs to other nodes
  4. Upon receiving a majority of votes, it becomes Leader
  5. The Leader periodically sends heartbeats (empty AppendEntries) to reset election timeouts

Election Restriction: A voting node only grants its vote if the candidate's log is at least as up-to-date as its own. This prevents nodes without committed entries from becoming Leader.

Log Replication

  1. The Leader appends client writes to its log
  2. Sends log entries to all Followers via AppendEntries RPC
  3. Once a majority responds → commit (apply to state machine)

Leader Completeness Property: Once committed, an entry is guaranteed to appear in the log of every future Leader. This is the core of Raft's safety.

Raft Simulator

Visualize Raft's leader election, log replication, and fault tolerance.

N1term=0N2term=0N3term=0N4term=0N5term=0
Replicated Log
N1
empty
N2
empty
N3
empty
N4
empty
N5
empty
RequestVoteVoteAppendEntries
Initial: All nodes are Followers at term=0. No Leader yet. Each node has an election timeout; when it expires, the node becomes a Candidate.
[1] All nodes start as Followers
1 / 5

Raft vs Paxos

AspectPaxosRaft
Leader roleOptimizationRequired
Log flowAny Proposer → AcceptorLeader → Follower (unidirectional)
UnderstandabilityNotoriously difficultA primary design goal
Safety proofEquivalentEquivalent
AdoptionChubby, Spanneretcd, CockroachDB, TiKV

PBFT — Byzantine Fault Tolerance

Paxos and Raft only handle crash faults. In environments where nodes can "lie," a more powerful protocol is needed.

In 1999, Miguel Castro and Barbara Liskov proposed PBFT (Practical Byzantine Fault Tolerance), a groundbreaking practical algorithm that tolerates Byzantine faults.

The PBFT Protocol

PBFT operates in three phases.

  1. Pre-prepare: The Primary (equivalent to a leader) assigns a sequence number to the request and sends it to all Replicas
  2. Prepare: Each node sends a Prepare message to all other nodes (all-to-all communication)
  3. Commit: Once a node collects 2f+12f+1 Prepare messages, it sends a Commit message to all other nodes

The Prepare and Commit phases each require collecting 2f+12f+1 messages, ensuring that honest nodes maintain a majority even with ff Byzantine nodes.

Why 3f+1 Is Required

With nn total nodes and ff Byzantine nodes, there are nfn - f honest nodes. To reach the 2f+12f+1 threshold using only honest nodes:

nf2f+1    n3f+1n - f \geq 2f + 1 \implies n \geq 3f + 1

With 4 nodes, the system tolerates f=1f=1; with 7 nodes, f=2f=2 Byzantine failures.

PBFT Simulator

Visualize Practical Byzantine Fault Tolerance with honest and Byzantine nodes.

N0PrimaryN1ReplicaN2ReplicaN3Replica
Pre-preparePrepareCommitByzantine msg
PBFT Normal case: 4 nodes (n=3f+1, f=1). N0 is Primary, N1–N3 are Replicas. Client request arrives.
[1] 4-node PBFT (f=1): tolerates up to 1 Byzantine fault
1 / 4

Challenges with PBFT

  • O(n2)O(n^2) Message Complexity: All-to-all communication means message volume grows quadratically with node count
  • Scalability: Practical limit is on the order of tens of nodes
  • View Change Complexity: The protocol for replacing a faulty Primary is complex

Algorithm Comparison

AlgorithmFault ModelToleranceMsg ComplexityLatencyUse Case
2PCCrash0O(n)2 roundsDatabases
PaxosCrashf < n/2O(n)2 roundsChubby, Spanner
RaftCrashf < n/2O(n)2 roundsetcd, CockroachDB
PBFTByzantinef < n/3O(n²)3 roundsBlockchain

Real-World Adoption

Examining where each algorithm is used reveals the relationship between theory and practice.

Paxos Family

  • Google Chubby: Distributed lock service using Multi-Paxos.
  • Google Spanner: Globally distributed database using Paxos for replication.
  • Apache ZooKeeper: Uses Zab protocol (a Paxos variant).

Raft Family

  • etcd: Kubernetes backend store. Uses Raft for consistency.
  • CockroachDB: Distributed SQL database. Each Range is a Raft group.
  • TiKV: Distributed KV store (storage layer of TiDB). Uses Raft.
  • Consul: Service mesh control plane. Uses Raft for consensus.

PBFT Family

  • Hyperledger Fabric: Enterprise blockchain. Early versions used PBFT; since v1.4.1 uses Raft, and since v3.0 uses SmartBFT (based on BFT-SMART).
  • Tendermint: Blockchain consensus engine. A PBFT variant.

Summary

Looking back at the evolution of consensus algorithms, several important patterns emerge.

  1. Expanding Fault Models: From crash faults → Byzantine faults. The stronger the faults to tolerate, the more nodes and messages are required.
  2. The Importance of Understandability: Paxos's difficulty gave rise to Raft. Correctness alone isn't enough—ease of understanding and implementation is a critical design metric.
  3. The Gap Between Theory and Practice: The FLP impossibility theorem says "perfection is impossible," but real systems work well enough using timeouts and randomization.
  4. Choosing Tradeoffs: Always guarantee safety; guarantee liveness "eventually"—this is the common strategy of modern consensus algorithms.

Consensus in distributed systems appears simple on the surface, but it's a deep field with over 40 years of accumulated research. If the simulators in this article helped convey the why behind each algorithm, even a little, that's a success.

References