How do five servers scattered across data centres all agree on the same sequence of events — even when some crash, networks partition, and messages arrive late or out of order? That is the consensus problem, and Raft is its most understandable solution.
Imagine you run a distributed key-value store across five servers in three data centres. A client writes x = 42. Which server gets the write? What if that server crashes immediately after? What if the network between two data centres goes down — do you have one cluster or two? What if both halves keep accepting writes — how do you merge them later?
These are not edge cases. They are the normal operating conditions of any distributed system. The consensus problem asks: can we make a set of servers agree on a single sequence of values — a log — even when some servers crash or messages are lost?
// Naive replication without consensus: write to all, hope for the best // Shows exactly what goes wrong when messages are delayed or nodes crash class NaiveNode { constructor(id) { this.id = id; this.store = {}; // local state this.alive = true; } write(key, value) { if (!this.alive) return false; // node is down this.store[key] = value; return true; } read(key) { return this.alive ? this.store[key] : null; } } function naiveClusterWrite(nodes, key, value) { // "Write to all" — fails silently if some nodes are down let successCount = 0; for (const node of nodes) if (node.write(key, value)) successCount++; return successCount; } const nodes = [1, 2, 3, 4, 5].map(id => new NaiveNode(id)); // === Problem 1: Node crash during write === nodes[2].alive = false; // node 3 crashes const written = naiveClusterWrite(nodes, "x", 42); console.log(`Write x=42: reached ${written}/5 nodes (node 3 was down)`); nodes[2].alive = true; // node 3 recovers — but missed the write! console.log(`Node 3 reads x: ${nodes[2].read("x")} ← undefined! Stale read.`); console.log(`Node 1 reads x: ${nodes[0].read("x")} ← correct`); console.log("Result: cluster has INCONSISTENT state\n"); // === Problem 2: Split-brain (network partition) === const groupA = nodes.slice(0, 2); // nodes 1-2 (think they're primary) const groupB = nodes.slice(2); // nodes 3-5 (also think they're primary) // Both groups accept writes independently naiveClusterWrite(groupA, "counter", 100); // group A writes 100 naiveClusterWrite(groupB, "counter", 200); // group B writes 200 console.log("After partition heals:"); nodes.forEach(n => console.log(` Node ${n.id}: counter=${n.read("counter")}`)); console.log("Result: SPLIT-BRAIN — two different values, no way to reconcile\n"); // === Problem 3: Who's in charge? === // Without a designated coordinator, any node can accept writes. // Two concurrent clients writing to "different primaries" → last-write-wins // is non-deterministic. No linearizability guarantee. console.log("Consensus requirement:"); console.log(" 1. Agreement: all live nodes eventually agree on the same value"); console.log(" 2. Validity: the agreed value was proposed by some node"); console.log(" 3. Termination: all live nodes eventually decide (no infinite waiting)");
Consensus in practice is almost always framed as the replicated state machine (RSM) model. The idea: if every node starts in the same state and applies the same sequence of commands in the same order, they will always arrive at the same state — no matter when or how each command is processed.
The challenge reduces to ensuring every node agrees on a single ordered log of commands. Raft does exactly this: it is a protocol for maintaining a replicated log. The state machine (a key-value store, a lock service, a configuration database) sits on top and is incidental to Raft itself.
// Replicated State Machine: all nodes apply the SAME log in the SAME order // If the log is consistent across nodes, state is automatically consistent class StateMachine { constructor(nodeId) { this.nodeId = nodeId; this.store = {}; this.lastApplied = 0; } apply(logEntry) { // Apply a committed log entry to the state machine const { index, command } = logEntry; if (index !== this.lastApplied + 1) throw new Error(`Gap in log: expected ${this.lastApplied+1}, got ${index}`); const { op, key, value } = command; if (op === "set") this.store[key] = value; else if (op === "del") delete this.store[key]; else if (op === "inc") this.store[key] = (this.store[key] || 0) + value; this.lastApplied = index; } applyLog(log) { for (const entry of log) this.apply(entry); return this.store; } } // The committed log — agreed upon by Raft before any node applies it const committedLog = [ { index: 1, term: 1, command: { op: "set", key: "balance", value: 1000 } }, { index: 2, term: 1, command: { op: "inc", key: "balance", value: -200 } }, { index: 3, term: 2, command: { op: "set", key: "user", value: "alice" } }, { index: 4, term: 2, command: { op: "inc", key: "balance", value: 500 } }, { index: 5, term: 3, command: { op: "del", key: "user", value: null } }, ]; // Five nodes — each applies the log at its own pace (maybe at different times) // but they all arrive at the SAME final state const nodeStates = [1, 2, 3, 4, 5].map(id => { const sm = new StateMachine(id); return sm.applyLog(committedLog); }); console.log("Committed log (5 entries, 3 terms):"); committedLog.forEach(e => console.log(` [${e.index}] term=${e.term} ${e.command.op}(${e.command.key}${e.command.value!=null?'='+e.command.value:''})`) ); console.log("\nState machine output on all 5 nodes:"); nodeStates.forEach((s, i) => console.log(` Node ${i+1}: ${JSON.stringify(s)}`) ); const allSame = nodeStates.every(s => JSON.stringify(s) === JSON.stringify(nodeStates[0])); console.log(`\nAll nodes have identical state: ${allSame} ✓`); console.log("Key insight: Raft's only job is to maintain this consistent log.");
Raft decomposes consensus into three relatively independent sub-problems and solves each with a clear, simple rule. The insight that makes Raft more understandable than its predecessor Paxos is the insistence on a single strong leader at all times.
In Raft, all log entries flow through the leader. Clients send commands only to the leader. The leader appends entries to its log, replicates them to followers, and only once a majority have acknowledged does it commit. Followers are completely passive — they accept what the leader sends.
Every Raft node is always in one of three states: follower, candidate, or leader. All nodes start as followers. Raft uses a notion of logical time called terms — monotonically increasing integers, each beginning with an election.
The key mechanism is the election timeout: each follower waits a random duration (typically 150–300ms) for a heartbeat from the current leader. If no heartbeat arrives in time, the follower converts to candidate, increments its term, and sends RequestVote RPCs to all other nodes. The first candidate to receive votes from a majority wins.
RequestVote RPCs, collects a majority, becomes leader, then re-establishes authority with heartbeats.If all nodes used the same timeout, they would all start elections simultaneously — and since each votes for itself, no one would get a majority. Randomising the timeout means one node almost always times out first, starts the election, and wins before others even wake up. The probability of ties is negligible and they self-resolve in the next term.
The animated cluster above shows the mechanics of one election. The timeline below serves a different purpose: it compresses many seconds or minutes of cluster life into one strip, so you can see that most of Raft is boring stable leadership, with brief election windows only when something goes wrong.
// Raft leader election — the complete algorithm class RaftNode { constructor(id, peers) { this.id = id; this.peers = peers; // array of other RaftNode references this.state = "follower"; this.currentTerm = 0; this.votedFor = null; // who we voted for in current term this.log = []; // [{term, command}] this.alive = true; } requestVote(candidateId, candidateTerm, lastLogIndex, lastLogTerm) { // RPC handler: should we vote for this candidate? if (!this.alive) return { voteGranted: false, term: -1 }; // Rule 1: candidate's term must be >= ours if (candidateTerm < this.currentTerm) return { voteGranted: false, term: this.currentTerm }; // Discovering a higher term: revert to follower if (candidateTerm > this.currentTerm) { this.currentTerm = candidateTerm; this.state = "follower"; this.votedFor = null; } // Rule 2: we haven't already voted in this term const alreadyVoted = this.votedFor !== null && this.votedFor !== candidateId; if (alreadyVoted) return { voteGranted: false, term: this.currentTerm }; // Rule 3: candidate's log must be at least as up-to-date as ours // "More up-to-date" = higher last log term, or same term + longer log const myLastLogTerm = this.log.length ? this.log.at(-1).term : 0; const myLastLogIndex = this.log.length; const logOk = lastLogTerm > myLastLogTerm || (lastLogTerm === myLastLogTerm && lastLogIndex >= myLastLogIndex); if (!logOk) return { voteGranted: false, term: this.currentTerm }; // Grant vote this.votedFor = candidateId; return { voteGranted: true, term: this.currentTerm }; } startElection() { this.state = "candidate"; this.currentTerm += 1; this.votedFor = this.id; // always vote for self let votesReceived = 1; // count self-vote const majority = Math.floor((this.peers.length + 2) / 2); const myLastLogTerm = this.log.length ? this.log.at(-1).term : 0; const myLastLogIndex = this.log.length; console.log(`Node ${this.id} starting election for term ${this.currentTerm}`); for (const peer of this.peers) { const resp = peer.requestVote(this.id, this.currentTerm, myLastLogIndex, myLastLogTerm); if (resp.voteGranted) { votesReceived++; console.log(` Node ${peer.id} → VOTE GRANTED (total: ${votesReceived}/${this.peers.length+1})`); } else { console.log(` Node ${peer.id} → vote denied (term=${resp.term})`); } } if (votesReceived >= majority) { this.state = "leader"; console.log(`Node ${this.id} WON election for term ${this.currentTerm} (${votesReceived}/${this.peers.length+1} votes)`); } else { this.state = "follower"; console.log(`Node ${this.id} lost election (${votesReceived} votes — not a majority)`); } } } // Build a 5-node cluster const cluster = [1,2,3,4,5].map(id => new RaftNode(id, [])); cluster.forEach(n => { n.peers = cluster.filter(p => p.id !== n.id); }); console.log("=== Normal election (node 2 times out first) ==="); cluster[1].startElection(); // Node 2 wins console.log("\n=== Leader crashes, node 4 times out next ==="); cluster[1].alive = false; // node 2 (leader) crashes // Node 4 has seen entries from term 1, so it has a valid log cluster[3].startElection();
Once a leader is elected, it handles all client requests. When a client sends a command, the leader: (1) appends the entry to its own log with the current term, (2) sends AppendEntries RPCs to all followers in parallel, (3) once a majority of nodes have written the entry, marks it as committed, (4) applies it to its state machine, (5) replies to the client, (6) notifies followers of the commit in subsequent RPCs.
// Log replication: leader's AppendEntries RPC // Includes the log matching check that keeps all followers consistent class RaftLog { constructor(nodeId) { this.nodeId = nodeId; this.entries = []; // [{term, command}] — 1-indexed in paper, 0-indexed here this.commitIndex = -1; // highest log index known to be committed this.lastApplied = -1; // highest log index applied to state machine } appendEntries({ leaderTerm, prevLogIndex, prevLogTerm, entries, leaderCommit }) { // The heart of log replication — called on each follower by the leader // Check 1: Log matching — do we agree on the entry before the new ones? if (prevLogIndex >= 0) { const ourEntry = this.entries[prevLogIndex]; if (!ourEntry) { console.log(` Node ${this.nodeId}: REJECT — missing entry at prevLogIndex=${prevLogIndex}`); return false; } if (ourEntry.term !== prevLogTerm) { // Conflicting entry: delete from here onwards (leader's log wins) console.log(` Node ${this.nodeId}: CONFLICT at index ${prevLogIndex} — truncating log`); this.entries.splice(prevLogIndex); return false; } } // Check 2: Append new entries (skip any we already have) let insertAt = prevLogIndex + 1; for (const entry of entries) { if (this.entries[insertAt]?.term !== entry.term) this.entries[insertAt] = entry; // overwrite only if different insertAt++; } // Update commit index if (leaderCommit > this.commitIndex) { this.commitIndex = Math.min(leaderCommit, this.entries.length - 1); console.log(` Node ${this.nodeId}: advancing commitIndex to ${this.commitIndex}`); } console.log(` Node ${this.nodeId}: log length=${this.entries.length}, commit=${this.commitIndex}`); return true; } } // Simulate a leader replicating entries to 4 followers const CLUSTER_SIZE = 5; const MAJORITY = Math.floor(CLUSTER_SIZE / 2) + 1; const followers = [2,3,4,5].map(id => new RaftLog(id)); const leaderLog = { entries: [], commitIndex: -1, term: 1 }; function leaderAppend(command) { leaderLog.entries.push({ term: leaderLog.term, command }); const newIndex = leaderLog.entries.length - 1; console.log(`Leader appends [${newIndex}] "${command}" in term ${leaderLog.term}`); // Send AppendEntries to each follower let acks = 1; // leader counts itself for (const follower of followers) { const ok = follower.appendEntries({ leaderTerm: leaderLog.term, prevLogIndex: newIndex - 1, prevLogTerm: newIndex > 0 ? leaderLog.entries[newIndex-1].term : -1, entries: [leaderLog.entries[newIndex]], leaderCommit: leaderLog.commitIndex, }); if (ok) acks++; } // Commit requires a majority of the original 5-node cluster, even if one follower is down if (acks >= MAJORITY) { leaderLog.commitIndex = newIndex; console.log(`Committed at index ${newIndex} (${acks}/${CLUSTER_SIZE} acks ≥ majority ${MAJORITY})\n`); } } // Normal case: replicate a sequence of commands console.log("=== Normal replication ==="); leaderAppend("set x=1"); leaderAppend("set y=2"); // One follower goes offline — still have majority (3 of 5) console.log("=== One follower offline — still commits ==="); const deadFollower = followers[3]; // node 5 goes down followers.splice(3, 1); leaderAppend("set z=3"); // 1 leader + 3 followers = 4/5 acks ≥ 3 followers.push(deadFollower);
A different problem arises when a follower falls behind. Raft does not need to stop the cluster or copy the entire log from scratch. The leader keeps a nextIndex[follower] pointer saying “the next log entry I think this follower needs.” If an AppendEntries attempt is rejected because prevLogIndex/prevLogTerm does not match, the leader moves nextIndex backward and retries with an earlier prefix. Once it finds a matching point, it sends the missing suffix and the follower catches up.
Raft's most critical property is the Leader Completeness property: if an entry is committed in some term, it will be present in the logs of all future leaders. This ensures the state machine never sees a committed entry go missing.
The mechanism is the vote restriction: a candidate can only receive a vote from a node if the candidate's log is at least as up-to-date as the voter's log. Since any committed entry was acknowledged by a majority, and any winning candidate must receive votes from a majority, those two majorities must overlap — so the winner must have seen the committed entry.
// Safety: why a stale node can never become leader // The vote restriction ensures committed entries are never lost function isMoreUpToDate(candidateLog, voterLog) { // Raft's "more up-to-date" definition (§5.4.1) // Compare last log term first; if equal, compare log length const candLastTerm = candidateLog.at(-1)?.term ?? 0; const voterLastTerm = voterLog.at(-1)?.term ?? 0; if (candLastTerm !== voterLastTerm) return candLastTerm > voterLastTerm; return candidateLog.length >= voterLog.length; } // Scenario: 5 nodes, leader crashes after committing entry [4] // Three possible survivors try to become leader const committedLog = [ { index:1, term:1, cmd:"x=1" }, { index:2, term:1, cmd:"y=2" }, { index:3, term:2, cmd:"z=3" }, { index:4, term:2, cmd:"w=4" }, // ← committed (majority acked) ]; // Node logs after leader crash (some had partial replication) const nodeLogs = { 2: [...committedLog], // has all 4 entries (up-to-date) 3: committedLog.slice(0, 3), // missing entry 4 (stale!) 4: committedLog.slice(0, 2), // missing entries 3 & 4 (very stale!) 5: [...committedLog], // has all 4 entries (up-to-date) }; console.log("After leader crash — node log lengths:"); Object.entries(nodeLogs).forEach(([id, log]) => console.log(` Node ${id}: ${log.length} entries, last term=${log.at(-1)?.term??0}`) ); // Simulate RequestVote — can node 3 (stale) become leader? console.log("\nCan stale node 3 get votes from nodes 4 and 5?"); [4, 5].forEach(voterId => { const grantVote = isMoreUpToDate(nodeLogs[3], nodeLogs[voterId]); console.log(` Node ${voterId} votes for node 3: ${grantVote ? "YES" : "NO — stale log rejected"}`); }); // Can node 2 (up-to-date) win? console.log("\nCan up-to-date node 2 get votes from nodes 3, 4, and 5?"); [3, 4, 5].forEach(voterId => { const grantVote = isMoreUpToDate(nodeLogs[2], nodeLogs[voterId]); console.log(` Node ${voterId} votes for node 2: ${grantVote ? "YES" : "NO"}`); }); console.log("\nKey insight:"); console.log(" Committed entry [4] was acked by nodes 1, 2, 5 (majority of 5)"); console.log(" Any winner must get votes from 3 of 5 nodes"); console.log(" Intersection of {1,2,5} and any majority of 5 is non-empty"); console.log(" → Winner always sees entry [4] → committed data is never lost ✓");
Raft is designed to remain correct under a wide range of failures. Here are the critical scenarios and how Raft handles each:
⌊(n−1)/2⌋ simultaneous failures in an n-node cluster: 1 of 3, 2 of 5, 3 of 7. This is also the minimum for any correct consensus protocol — it's a fundamental lower bound, not a Raft limitation. Paxos, Zab, and Multi-Paxos have the same bound.Putting it all together: a structurally complete Raft implementation covering leader election, log replication, commit tracking, and basic crash/recovery scenarios. This is ~200 lines of JavaScript — the same architecture as production implementations in etcd and CockroachDB, just without the network layer.
// Complete Raft node implementation // Covers: leader election, log replication, commit index, crash recovery class RaftNode { constructor(id) { this.id = id; this.peers = []; this.state = "follower"; this.currentTerm = 0; this.votedFor = null; this.log = []; // [{term, command}] this.commitIndex = -1; this.lastApplied = -1; this.nextIndex = {}; // leader: next log index to send to each peer this.matchIndex = {}; // leader: highest log index known replicated on each peer this.alive = true; this.stateMachine = {}; // applied commands go here } _upToDate(lastLogIndex, lastLogTerm) { const myTerm = this.log.at(-1)?.term ?? 0; const myLastIndex = this.log.length - 1; return lastLogTerm > myTerm || (lastLogTerm === myTerm && lastLogIndex >= myLastIndex); } requestVote(candidateId, term, lastLogIndex, lastLogTerm) { if (!this.alive) return { granted: false, term: 0 }; if (term > this.currentTerm) { this.currentTerm = term; this.state = "follower"; this.votedFor = null; } const canVote = term >= this.currentTerm && (this.votedFor === null || this.votedFor === candidateId) && this._upToDate(lastLogIndex, lastLogTerm); if (canVote) { this.votedFor = candidateId; return { granted: true, term }; } return { granted: false, term: this.currentTerm }; } appendEntries(leaderTerm, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit) { if (!this.alive || leaderTerm < this.currentTerm) return false; this.currentTerm = leaderTerm; this.state = "follower"; this.votedFor = null; // Log matching check if (prevLogIndex >= 0) { if (this.log.length <= prevLogIndex || this.log[prevLogIndex]?.term !== prevLogTerm) return false; } // Append new entries entries.forEach((e, i) => { this.log[prevLogIndex + 1 + i] = e; }); this.log.length = prevLogIndex + 1 + entries.length; // truncate if needed if (leaderCommit > this.commitIndex) { this.commitIndex = Math.min(leaderCommit, this.log.length - 1); this._applyCommitted(); } return true; } _applyCommitted() { while (this.lastApplied < this.commitIndex) { this.lastApplied++; const { key, value, op } = this.log[this.lastApplied].command; if (op === "set") this.stateMachine[key] = value; else if (op === "del") delete this.stateMachine[key]; } } becomeLeader() { this.state = "leader"; this.peers.forEach(p => { this.nextIndex[p.id] = this.log.length; this.matchIndex[p.id] = -1; }); console.log(`Node ${this.id} became LEADER for term ${this.currentTerm}`); } startElection() { if (!this.alive) return; this.currentTerm++; this.state = "candidate"; this.votedFor = this.id; let votes = 1; const lastIdx = this.log.length - 1; const lastTerm = this.log.at(-1)?.term ?? 0; for (const p of this.peers) { if (p.requestVote(this.id, this.currentTerm, lastIdx, lastTerm).granted) votes++; } if (votes > (this.peers.length + 1) / 2) this.becomeLeader(); else this.state = "follower"; } clientWrite(command) { // Client sends write to leader only if (!this.alive) throw new Error("Node is down"); if (this.state !== "leader") throw new Error("Not the leader"); const entry = { term: this.currentTerm, command }; this.log.push(entry); const entryIndex = this.log.length - 1; let acks = 1; // leader counts itself for (const peer of this.peers) { const prevIdx = entryIndex - 1; const prevTerm = prevIdx >= 0 ? this.log[prevIdx].term : -1; if (peer.appendEntries(this.currentTerm, this.id, prevIdx, prevTerm, [entry], this.commitIndex)) acks++; } if (acks > (this.peers.length + 1) / 2) { this.commitIndex = entryIndex; this._applyCommitted(); // Notify followers of new commitIndex on next heartbeat for (const peer of this.peers) peer.appendEntries(this.currentTerm, this.id, entryIndex, entry.term, [], this.commitIndex); return true; } return false; // failed to commit (lost majority) } } // === Demo: build cluster, elect leader, write data, crash, elect new leader === const nodes = [1,2,3,4,5].map(id => new RaftNode(id)); nodes.forEach(n => { n.peers = nodes.filter(p => p.id !== n.id); }); console.log("=== Phase 1: Election ==="); nodes[0].startElection(); // node 1 times out first const leader = nodes.find(n => n.state === "leader"); console.log("\n=== Phase 2: Normal writes ==="); leader.clientWrite({ op:"set", key:"x", value:10 }); leader.clientWrite({ op:"set", key:"y", value:20 }); leader.clientWrite({ op:"set", key:"z", value:30 }); console.log(`All nodes' state machines after writes:`); nodes.forEach(n => console.log(` Node ${n.id}: ${JSON.stringify(n.stateMachine)}`)); console.log("\n=== Phase 3: Leader crashes ==="); leader.alive = false; leader.state = "follower"; console.log(`Node ${leader.id} crashed`); // Node 3 wins new election const survivor = nodes.find(n => n.alive && n.id !== leader.id); survivor.startElection(); const newLeader = nodes.find(n => n.alive && n.state === "leader"); if (newLeader) { newLeader.clientWrite({ op:"set", key:"x", value:99 }); console.log(`After new write by node ${newLeader.id}:`); nodes.filter(n => n.alive).forEach(n => console.log(` Node ${n.id}: ${JSON.stringify(n.stateMachine)}`) ); }
Raft was explicitly designed as an alternative to Paxos that is easier to understand and implement. The original paper (Ongaro & Ousterhout, 2014) included a user study showing that students who learned Raft performed significantly better on quizzes than those who learned Paxos. Understandability was a first-class design goal.
Raft is now arguably the dominant consensus algorithm in new distributed systems work. Paxos dominates in legacy Google infrastructure (Chubby, Spanner's underlying protocol); Raft dominates in newer open-source infrastructure.
The most important insight from real deployments: the bottleneck is almost never the consensus protocol itself. Leader election takes milliseconds. Log replication overhead is a few percent. The hard problems in production Raft deployments are operational: detecting slow followers, managing log compaction (snapshotting), handling asymmetric network partitions, and tuning election timeouts for your latency environment.