Consensus and Replicated State Machines

The Core Problem

Consensus is the problem of getting a group of nodes to agree on a single value, even when some nodes crash or messages are delayed. It comes up in leader election, transaction commit, log ordering, and configuration management. Every fault-tolerant distributed system is built on top of some form of consensus.

Without proper consensus, a network partition can cause both sides to elect their own leader and accept writes independently. This condition is called split-brain: the system ends up with two divergent versions of state that must be reconciled or rolled back when the partition heals.

Replicated State Machines

A state machine is deterministic: the same inputs in the same order always produce the same state. State machine replication runs identical copies of the state machine on multiple servers, ensuring they all apply the same commands in the same order. A replicated log is the data structure that imposes that order – consensus ensures all servers agree on the contents of the log.

Key property: if every server starts from the same initial state and executes the same log entries in order, they will all reach the same state. This is why log ordering is the central challenge.

Consensus Properties

Three properties required for consensus are:

Agreement: All non-faulty processes decide on the same value.
Validity: The decided value must have been proposed by some process.
Termination: All non-faulty processes eventually decide.

Agreement and validity are safety properties (nothing bad happens). Termination is a liveness property (something good eventually happens). The tension between them is central to the difficulty of consensus.

FLP Impossibility

The FLP Impossibility Result proves that in a purely asynchronous distributed system, no deterministic algorithm can guarantee consensus if even one process may crash.

The core obstacle: in an asynchronous system, there is no way to distinguish a crashed process from a very slow one. This makes it impossible to deterministically break certain deadlocked states without risking a safety violation.

FLP does not mean consensus is impossible in practice. All real protocols handle it by guaranteeing safety unconditionally but sacrificing liveness under extreme instability (for example, if no leader can be elected because the network is too chaotic). Eventually, when the system stabilizes, progress resumes.

Paxos

Paxos was created by Leslie Lamport around 1989 and finally published in 1998. It is the foundational consensus algorithm in the field.

Paxos has three roles: proposers (initiate proposals), acceptors (vote), and learners (learn the decided value). In practice, servers typically play all three roles.

The protocol relies on a simple but powerful property: any two majorities of acceptors in a group of n share at least one member. That overlap means a new majority cannot form without including at least one acceptor that participated in an earlier majority. Combined with Paxos’s promise and value-selection rules, this prevents two different values from being chosen for the same decision.

The algorithm runs in two phases:

Phase 1 (Prepare/Promise): A proposer sends Prepare(n) to a majority of acceptors. Each acceptor promises to reject any proposal numbered below n and reports the highest proposal it has already accepted.
Phase 2 (Accept/Accepted): The proposer sends Accept(n, v) to the majority, where v is the value from the highest-numbered previously accepted proposal it learned about (or its own value if none were reported). An acceptor accepts Accept(n, v) only if it has not since promised a higher proposal number; if it accepts, it records (n, v) and replies Accepted.

A value is decided once a majority of acceptors have accepted it.

Multi-Paxos extends single-decree Paxos to decide a sequence of values (a log) by running a separate Phase 2 per log slot, with a stable leader that skips Phase 1 after the first slot. When the leader changes, the new leader must run Phase 1 again for any log slots it has not yet resolved, which is why leader instability is expensive in Paxos.

Paxos is notoriously difficult to implement correctly. It leaves important practical questions unspecified: conflict resolution between concurrent proposers, cluster membership changes, and recovery from partial failures. Real deployments (Google Chubby, Apache ZooKeeper’s Zab variant, Google Spanner) required substantial engineering beyond the base algorithm.

Raft

Raft was developed to provide the same safety guarantees as Multi-Paxos but is designed to be significantly easier to understand and implement. It is used in many newer systems, including etcd, CockroachDB, TiKV, Consul, and YugabyteDB.

Terms

Raft divides time into terms, each numbered with a consecutive integer. A term begins with an election. If a candidate wins, it serves as leader for that term. If no one wins (split vote), a new term begins. Terms serve as a logical clock – servers reject messages from older terms and update their term when they see a higher one.

Server States

Every server is in exactly one of three states. A follower is passive: it responds to requests from leaders and candidates but does not initiate any. All servers start as followers. A candidate is a follower that has timed out waiting for a heartbeat and has initiated an election. A leader handles all client requests, replicates log entries to followers, and sends periodic heartbeats to prevent new elections.

Leader Election

Each follower maintains a randomized election timeout (typically 150–300 ms). If it expires without hearing from a leader, the follower starts an election:

Increment the current term.
Transition to candidate state and vote for itself.
Send RequestVote RPCs to all other servers.
A server grants its vote if it has not already voted this term and the candidate’s log is at least as up-to-date as its own.
If the candidate receives votes from a majority, it becomes leader and immediately sends heartbeats to suppress new elections.
If no candidate wins (split vote), the term ends with no leader and a new election begins with a higher term.

“More up-to-date” is defined precisely: a log is more up-to-date if its last entry has a higher term, or if the terms are equal, the longer log wins. This restriction ensures that a candidate cannot win unless its log contains all committed entries. The randomized timeout makes it unlikely that multiple candidates start elections at the same time.

Log Replication

Once a leader is elected, it handles all client requests. For each command, the sequence is:

The leader appends the command to its own log, tagged with the current term and index.
It sends AppendEntries RPCs to all followers in parallel.
Once a majority of servers have acknowledged the entry, the leader commits it.
The leader applies the entry to its state machine and returns the result to the client.
Subsequent AppendEntries messages inform followers of the commit index, at which point followers apply the committed entries to their own state machines.

Each AppendEntries RPC includes the index and term of the entry immediately preceding the new one. A follower rejects the RPC if its own log does not match at that position. When this happens, the leader backs up and retries from an earlier entry until it finds a point of agreement, then overwrites any conflicting entries from that point forward.

The Log Matching Property guarantees that if two entries in different logs share the same index and term, the logs are identical through that index. This invariant is what makes the consistency check in AppendEntries sufficient to detect and repair divergence.

Commit Rules

An entry is committed once stored on a majority. However, the leader may not commit entries from previous terms directly – it must first commit an entry from its own current term. Old entries become committed implicitly when the current-term entry is committed. This prevents a subtle safety bug involving overwriting entries that were transiently replicated but not truly committed.

Safety: The Leader Completeness Property

If an entry is committed in a given term, it will appear in the log of every leader in all subsequent terms. This follows from the election restriction: a candidate can only win if its log is at least as up-to-date as any majority, which means it must have all committed entries. Safety in Raft is unconditional – the protocol never allows two servers to commit different entries at the same index, as long as fewer than half the servers fail.

Liveness

Liveness is conditional. Raft requires a stable elected leader to make progress. If elections repeatedly fail due to network instability, the system stalls. In practice, randomized timeouts prevent this from being a persistent problem.

Cluster Membership Changes

Adding or removing servers requires care. Raft uses joint consensus, a two-phase approach: the cluster transitions through a configuration that includes both old and new member sets, requiring majority agreement from both before switching to the new configuration alone. Even when adding or removing one server at a time, the reconfiguration must ensure majority intersection across the transition – joint consensus is how Raft guarantees that property.

Log Compaction

Logs grow indefinitely. Servers periodically take a snapshot of the state machine and discard all log entries before that point. If a follower falls too far behind, the leader sends it the snapshot directly via an InstallSnapshot RPC.

Key Takeaways

Consensus is hard because crash failures make it impossible to distinguish a dead process from a slow one. FLP tells us something fundamental about this: in a fully asynchronous model, no algorithm can be both safe and live if any process can crash. Real protocols choose safety and accept that they may stall temporarily.

Raft improves on Paxos primarily through clarity of design: a single strong leader, randomized election timeouts, and unified handling of log replication and leader completeness. The safety guarantees are equivalent; the difference is in how easy the protocol is to reason about and implement correctly.

What You Don’t Need to Study

The exact millisecond ranges for election timeouts.
The publication histories of Paxos or Raft.
Details of specific Paxos variants like Zab or Multi-Paxos optimizations.
The internal architecture of specific systems that use Raft or Paxos.