Distributed systems · Consensus algorithms

The Raft consensus
algorithm

How do five servers scattered across data centres all agree on the same sequence of events — even when some crash, networks partition, and messages arrive late or out of order? That is the consensus problem, and Raft is its most understandable solution.

Problem

Distributed agreement

multiple nodes, one truth

Approach

Strong leader

one node coordinates all writes

Key mechanism

Leader election

randomised timeouts + terms

Correctness

Log replication

majority quorum commits

Fault tolerance

⌊n−1⌋/2 failures

5 nodes survive 2 crashes

Leader election Log replication Safety Majority quorum etcd · CockroachDB · TiKV

§01

The problem: distributed agreement

Imagine you run a distributed key-value store across five servers in three data centres. A client writes x = 42. Which server gets the write? What if that server crashes immediately after? What if the network between two data centres goes down — do you have one cluster or two? What if both halves keep accepting writes — how do you merge them later?

These are not edge cases. They are the normal operating conditions of any distributed system. The consensus problem asks: can we make a set of servers agree on a single sequence of values — a log — even when some servers crash or messages are lost?

Without consensus, you get split-brain: two parts of your cluster independently think they're in charge. Both accept writes. When connectivity restores, you have two different "truths" with no way to automatically reconcile them. Every distributed database, coordination service, and replicated log needs to solve this.

What can go wrong — click a failure scenario

the_problem.js — why naive replication fails

// Naive replication without consensus: write to all, hope for the best
// Shows exactly what goes wrong when messages are delayed or nodes crash

class NaiveNode {
  constructor(id) {
    this.id    = id;
    this.store = {};   // local state
    this.alive = true;
  }
  write(key, value) {
    if (!this.alive) return false; // node is down
    this.store[key] = value;
    return true;
  }
  read(key) { return this.alive ? this.store[key] : null; }
}

function naiveClusterWrite(nodes, key, value) {
  // "Write to all" — fails silently if some nodes are down
  let successCount = 0;
  for (const node of nodes)
    if (node.write(key, value)) successCount++;
  return successCount;
}

const nodes = [1, 2, 3, 4, 5].map(id => new NaiveNode(id));

// === Problem 1: Node crash during write ===
nodes[2].alive = false; // node 3 crashes
const written = naiveClusterWrite(nodes, "x", 42);
console.log(`Write x=42: reached ${written}/5 nodes (node 3 was down)`);
nodes[2].alive = true; // node 3 recovers — but missed the write!
console.log(`Node 3 reads x: ${nodes[2].read("x")} ← undefined! Stale read.`);
console.log(`Node 1 reads x: ${nodes[0].read("x")} ← correct`);
console.log("Result: cluster has INCONSISTENT state\n");

// === Problem 2: Split-brain (network partition) ===
const groupA = nodes.slice(0, 2); // nodes 1-2 (think they're primary)
const groupB = nodes.slice(2);    // nodes 3-5 (also think they're primary)
// Both groups accept writes independently
naiveClusterWrite(groupA, "counter", 100); // group A writes 100
naiveClusterWrite(groupB, "counter", 200); // group B writes 200
console.log("After partition heals:");
nodes.forEach(n => console.log(`  Node ${n.id}: counter=${n.read("counter")}`));
console.log("Result: SPLIT-BRAIN — two different values, no way to reconcile\n");

// === Problem 3: Who's in charge? ===
// Without a designated coordinator, any node can accept writes.
// Two concurrent clients writing to "different primaries" → last-write-wins
// is non-deterministic. No linearizability guarantee.
console.log("Consensus requirement:");
console.log("  1. Agreement:   all live nodes eventually agree on the same value");
console.log("  2. Validity:    the agreed value was proposed by some node");
console.log("  3. Termination: all live nodes eventually decide (no infinite waiting)");

§02

The replicated state machine

Consensus in practice is almost always framed as the replicated state machine (RSM) model. The idea: if every node starts in the same state and applies the same sequence of commands in the same order, they will always arrive at the same state — no matter when or how each command is processed.

The challenge reduces to ensuring every node agrees on a single ordered log of commands. Raft does exactly this: it is a protocol for maintaining a replicated log. The state machine (a key-value store, a lock service, a configuration database) sits on top and is incidental to Raft itself.

State machine determinism — same commands, same order → same result

replicated_state_machine.js — the RSM abstraction

// Replicated State Machine: all nodes apply the SAME log in the SAME order
// If the log is consistent across nodes, state is automatically consistent

class StateMachine {
  constructor(nodeId) {
    this.nodeId = nodeId;
    this.store  = {};
    this.lastApplied = 0;
  }

  apply(logEntry) {
    // Apply a committed log entry to the state machine
    const { index, command } = logEntry;
    if (index !== this.lastApplied + 1)
      throw new Error(`Gap in log: expected ${this.lastApplied+1}, got ${index}`);

    const { op, key, value } = command;
    if      (op === "set") this.store[key] = value;
    else if (op === "del") delete this.store[key];
    else if (op === "inc") this.store[key] = (this.store[key] || 0) + value;
    this.lastApplied = index;
  }

  applyLog(log) {
    for (const entry of log) this.apply(entry);
    return this.store;
  }
}

// The committed log — agreed upon by Raft before any node applies it
const committedLog = [
  { index: 1, term: 1, command: { op: "set", key: "balance", value: 1000 } },
  { index: 2, term: 1, command: { op: "inc", key: "balance", value: -200 } },
  { index: 3, term: 2, command: { op: "set", key: "user",    value: "alice" } },
  { index: 4, term: 2, command: { op: "inc", key: "balance", value: 500  } },
  { index: 5, term: 3, command: { op: "del", key: "user",    value: null   } },
];

// Five nodes — each applies the log at its own pace (maybe at different times)
// but they all arrive at the SAME final state
const nodeStates = [1, 2, 3, 4, 5].map(id => {
  const sm = new StateMachine(id);
  return sm.applyLog(committedLog);
});

console.log("Committed log (5 entries, 3 terms):");
committedLog.forEach(e =>
  console.log(`  [${e.index}] term=${e.term} ${e.command.op}(${e.command.key}${e.command.value!=null?'='+e.command.value:''})`)
);
console.log("\nState machine output on all 5 nodes:");
nodeStates.forEach((s, i) =>
  console.log(`  Node ${i+1}: ${JSON.stringify(s)}`)
);
const allSame = nodeStates.every(s => JSON.stringify(s) === JSON.stringify(nodeStates[0]));
console.log(`\nAll nodes have identical state: ${allSame} ✓`);
console.log("Key insight: Raft's only job is to maintain this consistent log.");

§03

Raft's core insight: strong leader

Raft decomposes consensus into three relatively independent sub-problems and solves each with a clear, simple rule. The insight that makes Raft more understandable than its predecessor Paxos is the insistence on a single strong leader at all times.

In Raft, all log entries flow through the leader. Clients send commands only to the leader. The leader appends entries to its log, replicates them to followers, and only once a majority have acknowledged does it commit. Followers are completely passive — they accept what the leader sends.

The strong leader assumption is a deliberate trade-off. It eliminates whole classes of consistency problems (concurrent writes, log divergence, speculative execution) at the cost of throughput: all writes go through one node. For most coordination workloads this is the right trade-off. Paxos allows multiple concurrent proposers — more complex to implement correctly.

Raft's three sub-problems and their solutions

Leader Election

Randomised election timeouts ensure one node starts an election first. Majority vote prevents two winners.

Log Replication

Leader sends AppendEntries RPCs to all followers. Entry is committed when a majority acknowledges it.

Safety

Only candidates with the most up-to-date log can win elections. Committed entries are never lost.

§04

Leader election

Every Raft node is always in one of three states: follower, candidate, or leader. All nodes start as followers. Raft uses a notion of logical time called terms — monotonically increasing integers, each beginning with an election.

The key mechanism is the election timeout: each follower waits a random duration (typically 150–300ms) for a heartbeat from the current leader. If no heartbeat arrives in time, the follower converts to candidate, increments its term, and sends RequestVote RPCs to all other nodes. The first candidate to receive votes from a majority wins.

Interactive cluster — crash the leader and watch a new election

Read it left to right: after the crash, one follower's timeout expires first, it becomes candidate for a new term, sends RequestVote RPCs, collects a majority, becomes leader, then re-establishes authority with heartbeats.

Why randomised timeouts work

If all nodes used the same timeout, they would all start elections simultaneously — and since each votes for itself, no one would get a majority. Randomising the timeout means one node almost always times out first, starts the election, and wins before others even wake up. The probability of ties is negligible and they self-resolve in the next term.

The animated cluster above shows the mechanics of one election. The timeline below serves a different purpose: it compresses many seconds or minutes of cluster life into one strip, so you can see that most of Raft is boring stable leadership, with brief election windows only when something goes wrong.

Macro timeline — long steady terms, short election windows

Term 1
election

Term 1: Node A is leader
normal operation, heartbeats

Term 2
A crashed

Term 2: Node C is leader

T3
split

T4
elect

Term 4: Node B is leader

Read this as rhythm, not packet flow: long stretches of stable leadership are interrupted by short election terms. Term 3 shows a failed attempt (split vote), so the cluster quickly rolls into term 4 and elects a leader there.

leader_election.js — RequestVote RPC and election logic

// Raft leader election — the complete algorithm

class RaftNode {
  constructor(id, peers) {
    this.id          = id;
    this.peers       = peers;       // array of other RaftNode references
    this.state       = "follower";
    this.currentTerm = 0;
    this.votedFor    = null;        // who we voted for in current term
    this.log         = [];          // [{term, command}]
    this.alive       = true;
  }

  requestVote(candidateId, candidateTerm, lastLogIndex, lastLogTerm) {
    // RPC handler: should we vote for this candidate?
    if (!this.alive) return { voteGranted: false, term: -1 };

    // Rule 1: candidate's term must be >= ours
    if (candidateTerm < this.currentTerm)
      return { voteGranted: false, term: this.currentTerm };

    // Discovering a higher term: revert to follower
    if (candidateTerm > this.currentTerm) {
      this.currentTerm = candidateTerm;
      this.state    = "follower";
      this.votedFor = null;
    }

    // Rule 2: we haven't already voted in this term
    const alreadyVoted = this.votedFor !== null && this.votedFor !== candidateId;
    if (alreadyVoted) return { voteGranted: false, term: this.currentTerm };

    // Rule 3: candidate's log must be at least as up-to-date as ours
    // "More up-to-date" = higher last log term, or same term + longer log
    const myLastLogTerm  = this.log.length ? this.log.at(-1).term : 0;
    const myLastLogIndex = this.log.length;
    const logOk = lastLogTerm > myLastLogTerm ||
                  (lastLogTerm === myLastLogTerm && lastLogIndex >= myLastLogIndex);
    if (!logOk) return { voteGranted: false, term: this.currentTerm };

    // Grant vote
    this.votedFor = candidateId;
    return { voteGranted: true, term: this.currentTerm };
  }

  startElection() {
    this.state       = "candidate";
    this.currentTerm += 1;
    this.votedFor    = this.id;    // always vote for self
    let votesReceived = 1;         // count self-vote
    const majority = Math.floor((this.peers.length + 2) / 2);
    const myLastLogTerm  = this.log.length ? this.log.at(-1).term : 0;
    const myLastLogIndex = this.log.length;

    console.log(`Node ${this.id} starting election for term ${this.currentTerm}`);
    for (const peer of this.peers) {
      const resp = peer.requestVote(this.id, this.currentTerm, myLastLogIndex, myLastLogTerm);
      if (resp.voteGranted) {
        votesReceived++;
        console.log(`  Node ${peer.id} → VOTE GRANTED (total: ${votesReceived}/${this.peers.length+1})`);
      } else {
        console.log(`  Node ${peer.id} → vote denied (term=${resp.term})`);
      }
    }

    if (votesReceived >= majority) {
      this.state = "leader";
      console.log(`Node ${this.id} WON election for term ${this.currentTerm} (${votesReceived}/${this.peers.length+1} votes)`);
    } else {
      this.state = "follower";
      console.log(`Node ${this.id} lost election (${votesReceived} votes — not a majority)`);
    }
  }
}

// Build a 5-node cluster
const cluster = [1,2,3,4,5].map(id => new RaftNode(id, []));
cluster.forEach(n => { n.peers = cluster.filter(p => p.id !== n.id); });

console.log("=== Normal election (node 2 times out first) ===");
cluster[1].startElection(); // Node 2 wins

console.log("\n=== Leader crashes, node 4 times out next ===");
cluster[1].alive = false; // node 2 (leader) crashes
// Node 4 has seen entries from term 1, so it has a valid log
cluster[3].startElection();

§05

Log replication

Once a leader is elected, it handles all client requests. When a client sends a command, the leader: (1) appends the entry to its own log with the current term, (2) sends AppendEntries RPCs to all followers in parallel, (3) once a majority of nodes have written the entry, marks it as committed, (4) applies it to its state machine, (5) replies to the client, (6) notifies followers of the commit in subsequent RPCs.

A majority quorum is any set of nodes containing more than half the cluster: in a 5-node cluster, majority = 3. This ensures any two majorities overlap by at least one node — which guarantees the new leader will always see the latest committed entries.

Log replication — step through appending an entry

log_replication.js — AppendEntries RPC with consistency checks

// Log replication: leader's AppendEntries RPC
// Includes the log matching check that keeps all followers consistent

class RaftLog {
  constructor(nodeId) {
    this.nodeId      = nodeId;
    this.entries     = [];     // [{term, command}] — 1-indexed in paper, 0-indexed here
    this.commitIndex = -1;    // highest log index known to be committed
    this.lastApplied = -1;    // highest log index applied to state machine
  }

  appendEntries({ leaderTerm, prevLogIndex, prevLogTerm, entries, leaderCommit }) {
    // The heart of log replication — called on each follower by the leader

    // Check 1: Log matching — do we agree on the entry before the new ones?
    if (prevLogIndex >= 0) {
      const ourEntry = this.entries[prevLogIndex];
      if (!ourEntry) {
        console.log(`  Node ${this.nodeId}: REJECT — missing entry at prevLogIndex=${prevLogIndex}`);
        return false;
      }
      if (ourEntry.term !== prevLogTerm) {
        // Conflicting entry: delete from here onwards (leader's log wins)
        console.log(`  Node ${this.nodeId}: CONFLICT at index ${prevLogIndex} — truncating log`);
        this.entries.splice(prevLogIndex);
        return false;
      }
    }

    // Check 2: Append new entries (skip any we already have)
    let insertAt = prevLogIndex + 1;
    for (const entry of entries) {
      if (this.entries[insertAt]?.term !== entry.term)
        this.entries[insertAt] = entry;  // overwrite only if different
      insertAt++;
    }

    // Update commit index
    if (leaderCommit > this.commitIndex) {
      this.commitIndex = Math.min(leaderCommit, this.entries.length - 1);
      console.log(`  Node ${this.nodeId}: advancing commitIndex to ${this.commitIndex}`);
    }
    console.log(`  Node ${this.nodeId}: log length=${this.entries.length}, commit=${this.commitIndex}`);
    return true;
  }
}

// Simulate a leader replicating entries to 4 followers
const CLUSTER_SIZE = 5;
const MAJORITY = Math.floor(CLUSTER_SIZE / 2) + 1;
const followers = [2,3,4,5].map(id => new RaftLog(id));
const leaderLog = { entries: [], commitIndex: -1, term: 1 };

function leaderAppend(command) {
  leaderLog.entries.push({ term: leaderLog.term, command });
  const newIndex = leaderLog.entries.length - 1;
  console.log(`Leader appends [${newIndex}] "${command}" in term ${leaderLog.term}`);

  // Send AppendEntries to each follower
  let acks = 1; // leader counts itself
  for (const follower of followers) {
    const ok = follower.appendEntries({
      leaderTerm:   leaderLog.term,
      prevLogIndex: newIndex - 1,
      prevLogTerm:  newIndex > 0 ? leaderLog.entries[newIndex-1].term : -1,
      entries:      [leaderLog.entries[newIndex]],
      leaderCommit: leaderLog.commitIndex,
    });
    if (ok) acks++;
  }

  // Commit requires a majority of the original 5-node cluster, even if one follower is down
  if (acks >= MAJORITY) {
    leaderLog.commitIndex = newIndex;
    console.log(`Committed at index ${newIndex} (${acks}/${CLUSTER_SIZE} acks ≥ majority ${MAJORITY})\n`);
  }
}

// Normal case: replicate a sequence of commands
console.log("=== Normal replication ===");
leaderAppend("set x=1");
leaderAppend("set y=2");

// One follower goes offline — still have majority (3 of 5)
console.log("=== One follower offline — still commits ===");
const deadFollower = followers[3]; // node 5 goes down
followers.splice(3, 1);
leaderAppend("set z=3"); // 1 leader + 3 followers = 4/5 acks ≥ 3
followers.push(deadFollower);

A different problem arises when a follower falls behind. Raft does not need to stop the cluster or copy the entire log from scratch. The leader keeps a nextIndex[follower] pointer saying “the next log entry I think this follower needs.” If an AppendEntries attempt is rejected because prevLogIndex/prevLogTerm does not match, the leader moves nextIndex backward and retries with an earlier prefix. Once it finds a matching point, it sends the missing suffix and the follower catches up.

How a stale follower catches up

§06

Safety guarantees

Raft's most critical property is the Leader Completeness property: if an entry is committed in some term, it will be present in the logs of all future leaders. This ensures the state machine never sees a committed entry go missing.

The mechanism is the vote restriction: a candidate can only receive a vote from a node if the candidate's log is at least as up-to-date as the voter's log. Since any committed entry was acknowledged by a majority, and any winning candidate must receive votes from a majority, those two majorities must overlap — so the winner must have seen the committed entry.

Why the vote restriction prevents data loss — a safety scenario

safety.js — the Leader Completeness property

// Safety: why a stale node can never become leader
// The vote restriction ensures committed entries are never lost

function isMoreUpToDate(candidateLog, voterLog) {
  // Raft's "more up-to-date" definition (§5.4.1)
  // Compare last log term first; if equal, compare log length
  const candLastTerm  = candidateLog.at(-1)?.term ?? 0;
  const voterLastTerm = voterLog.at(-1)?.term  ?? 0;

  if (candLastTerm !== voterLastTerm)
    return candLastTerm > voterLastTerm;
  return candidateLog.length >= voterLog.length;
}

// Scenario: 5 nodes, leader crashes after committing entry [4]
// Three possible survivors try to become leader
const committedLog = [
  { index:1, term:1, cmd:"x=1" },
  { index:2, term:1, cmd:"y=2" },
  { index:3, term:2, cmd:"z=3" },
  { index:4, term:2, cmd:"w=4" }, // ← committed (majority acked)
];

// Node logs after leader crash (some had partial replication)
const nodeLogs = {
  2: [...committedLog],                    // has all 4 entries (up-to-date)
  3: committedLog.slice(0, 3),             // missing entry 4 (stale!)
  4: committedLog.slice(0, 2),             // missing entries 3 & 4 (very stale!)
  5: [...committedLog],                    // has all 4 entries (up-to-date)
};

console.log("After leader crash — node log lengths:");
Object.entries(nodeLogs).forEach(([id, log]) =>
  console.log(`  Node ${id}: ${log.length} entries, last term=${log.at(-1)?.term??0}`)
);

// Simulate RequestVote — can node 3 (stale) become leader?
console.log("\nCan stale node 3 get votes from nodes 4 and 5?");
[4, 5].forEach(voterId => {
  const grantVote = isMoreUpToDate(nodeLogs[3], nodeLogs[voterId]);
  console.log(`  Node ${voterId} votes for node 3: ${grantVote ? "YES" : "NO — stale log rejected"}`);
});

// Can node 2 (up-to-date) win?
console.log("\nCan up-to-date node 2 get votes from nodes 3, 4, and 5?");
[3, 4, 5].forEach(voterId => {
  const grantVote = isMoreUpToDate(nodeLogs[2], nodeLogs[voterId]);
  console.log(`  Node ${voterId} votes for node 2: ${grantVote ? "YES" : "NO"}`);
});

console.log("\nKey insight:");
console.log("  Committed entry [4] was acked by nodes 1, 2, 5 (majority of 5)");
console.log("  Any winner must get votes from 3 of 5 nodes");
console.log("  Intersection of {1,2,5} and any majority of 5 is non-empty");
console.log("  → Winner always sees entry [4] → committed data is never lost ✓");

§07

Handling failures

Raft is designed to remain correct under a wide range of failures. Here are the critical scenarios and how Raft handles each:

Failure scenario	What Raft does	Result
Leader crashes	Followers' election timeouts fire; new election; new leader elected within one election timeout period (~300ms)	Automatic, fast recovery
Follower crashes	Leader keeps retrying AppendEntries indefinitely; follower catches up when it restarts using prevLogIndex/Term matching	Leader absorbs the failure
Network partition	Minority partition loses leader (can't get majority), elects nothing. Majority partition elects new leader and keeps operating	Majority side stays live
Old leader rejoins after partition	New leader has higher term; old leader sees this in first AppendEntries reply, steps down to follower, syncs log	Self-healing, no manual intervention
Duplicate messages / retries	AppendEntries is idempotent: if entry already present with same term, skip. CommitIndex only advances forward	At-most-once semantics with client IDs
Candidate gets split vote	Election times out, new term, new election. Randomised timeouts prevent repeated splits	Retry in next term (brief unavailability)
2 or more nodes crash (5-node cluster)	Cluster loses majority quorum — no commits possible, reads may be stale	Cluster pauses until nodes recover

Raft tolerates at most ⌊(n−1)/2⌋ simultaneous failures in an n-node cluster: 1 of 3, 2 of 5, 3 of 7. This is also the minimum for any correct consensus protocol — it's a fundamental lower bound, not a Raft limitation. Paxos, Zab, and Multi-Paxos have the same bound.

§08

Full Raft simulation in JavaScript

Putting it all together: a structurally complete Raft implementation covering leader election, log replication, commit tracking, and basic crash/recovery scenarios. This is ~200 lines of JavaScript — the same architecture as production implementations in etcd and CockroachDB, just without the network layer.

raft_complete.js — full Raft node with election + replication

// Complete Raft node implementation
// Covers: leader election, log replication, commit index, crash recovery

class RaftNode {
  constructor(id) {
    this.id          = id;
    this.peers       = [];
    this.state       = "follower";
    this.currentTerm = 0;
    this.votedFor    = null;
    this.log         = [];          // [{term, command}]
    this.commitIndex = -1;
    this.lastApplied = -1;
    this.nextIndex   = {};          // leader: next log index to send to each peer
    this.matchIndex  = {};          // leader: highest log index known replicated on each peer
    this.alive       = true;
    this.stateMachine = {};         // applied commands go here
  }

  _upToDate(lastLogIndex, lastLogTerm) {
    const myTerm      = this.log.at(-1)?.term ?? 0;
    const myLastIndex = this.log.length - 1;
    return lastLogTerm > myTerm || (lastLogTerm === myTerm && lastLogIndex >= myLastIndex);
  }

  requestVote(candidateId, term, lastLogIndex, lastLogTerm) {
    if (!this.alive) return { granted: false, term: 0 };
    if (term > this.currentTerm) {
      this.currentTerm = term; this.state = "follower"; this.votedFor = null;
    }
    const canVote = term >= this.currentTerm
      && (this.votedFor === null || this.votedFor === candidateId)
      && this._upToDate(lastLogIndex, lastLogTerm);
    if (canVote) { this.votedFor = candidateId; return { granted: true, term }; }
    return { granted: false, term: this.currentTerm };
  }

  appendEntries(leaderTerm, leaderId, prevLogIndex, prevLogTerm, entries, leaderCommit) {
    if (!this.alive || leaderTerm < this.currentTerm) return false;
    this.currentTerm = leaderTerm;
    this.state = "follower"; this.votedFor = null;
    // Log matching check
    if (prevLogIndex >= 0) {
      if (this.log.length <= prevLogIndex || this.log[prevLogIndex]?.term !== prevLogTerm)
        return false;
    }
    // Append new entries
    entries.forEach((e, i) => { this.log[prevLogIndex + 1 + i] = e; });
    this.log.length = prevLogIndex + 1 + entries.length; // truncate if needed
    if (leaderCommit > this.commitIndex) {
      this.commitIndex = Math.min(leaderCommit, this.log.length - 1);
      this._applyCommitted();
    }
    return true;
  }

  _applyCommitted() {
    while (this.lastApplied < this.commitIndex) {
      this.lastApplied++;
      const { key, value, op } = this.log[this.lastApplied].command;
      if      (op === "set") this.stateMachine[key] = value;
      else if (op === "del") delete this.stateMachine[key];
    }
  }

  becomeLeader() {
    this.state = "leader";
    this.peers.forEach(p => {
      this.nextIndex[p.id]  = this.log.length;
      this.matchIndex[p.id] = -1;
    });
    console.log(`Node ${this.id} became LEADER for term ${this.currentTerm}`);
  }

  startElection() {
    if (!this.alive) return;
    this.currentTerm++; this.state = "candidate"; this.votedFor = this.id;
    let votes = 1;
    const lastIdx  = this.log.length - 1;
    const lastTerm = this.log.at(-1)?.term ?? 0;
    for (const p of this.peers) {
      if (p.requestVote(this.id, this.currentTerm, lastIdx, lastTerm).granted) votes++;
    }
    if (votes > (this.peers.length + 1) / 2) this.becomeLeader();
    else this.state = "follower";
  }

  clientWrite(command) {
    // Client sends write to leader only
    if (!this.alive) throw new Error("Node is down");
    if (this.state !== "leader") throw new Error("Not the leader");
    const entry = { term: this.currentTerm, command };
    this.log.push(entry);
    const entryIndex = this.log.length - 1;
    let acks = 1; // leader counts itself
    for (const peer of this.peers) {
      const prevIdx  = entryIndex - 1;
      const prevTerm = prevIdx >= 0 ? this.log[prevIdx].term : -1;
      if (peer.appendEntries(this.currentTerm, this.id, prevIdx, prevTerm, [entry], this.commitIndex))
        acks++;
    }
    if (acks > (this.peers.length + 1) / 2) {
      this.commitIndex = entryIndex;
      this._applyCommitted();
      // Notify followers of new commitIndex on next heartbeat
      for (const peer of this.peers)
        peer.appendEntries(this.currentTerm, this.id, entryIndex, entry.term, [], this.commitIndex);
      return true;
    }
    return false; // failed to commit (lost majority)
  }
}

// === Demo: build cluster, elect leader, write data, crash, elect new leader ===
const nodes = [1,2,3,4,5].map(id => new RaftNode(id));
nodes.forEach(n => { n.peers = nodes.filter(p => p.id !== n.id); });

console.log("=== Phase 1: Election ===");
nodes[0].startElection(); // node 1 times out first
const leader = nodes.find(n => n.state === "leader");

console.log("\n=== Phase 2: Normal writes ===");
leader.clientWrite({ op:"set", key:"x", value:10 });
leader.clientWrite({ op:"set", key:"y", value:20 });
leader.clientWrite({ op:"set", key:"z", value:30 });
console.log(`All nodes' state machines after writes:`);
nodes.forEach(n => console.log(`  Node ${n.id}: ${JSON.stringify(n.stateMachine)}`));

console.log("\n=== Phase 3: Leader crashes ===");
leader.alive = false;
leader.state = "follower";
console.log(`Node ${leader.id} crashed`);

// Node 3 wins new election
const survivor = nodes.find(n => n.alive && n.id !== leader.id);
survivor.startElection();
const newLeader = nodes.find(n => n.alive && n.state === "leader");
if (newLeader) {
  newLeader.clientWrite({ op:"set", key:"x", value:99 });
  console.log(`After new write by node ${newLeader.id}:`);
  nodes.filter(n => n.alive).forEach(n =>
    console.log(`  Node ${n.id}: ${JSON.stringify(n.stateMachine)}`)
  );
}

§09

Raft vs Paxos

Raft was explicitly designed as an alternative to Paxos that is easier to understand and implement. The original paper (Ongaro & Ousterhout, 2014) included a user study showing that students who learned Raft performed significantly better on quizzes than those who learned Paxos. Understandability was a first-class design goal.

Dimension	Paxos (Multi-Paxos)	Raft
Leadership	Multiple concurrent proposers possible; leader helpful but not required	Single strong leader always; simpler invariants
Log handling	Log entries can be accepted out of order; holes allowed temporarily	Log strictly sequential; no holes
Leader election	Not specified in original paper; implementations vary widely	Fully specified; randomised timeouts
Cluster changes	Each implementation handles membership changes differently	Joint consensus specified in paper
Understandability	Notoriously difficult; "There is only one consensus protocol: Paxos" — Lamport	Designed for understandability; user study confirmed
Performance	Can be higher (pipelined multi-leader variants like EPaxos)	Single leader limits write throughput; fine for coordination
Provable safety	Yes (original Paxos paper)	Yes (TLA+ spec available)

Lamport's original Paxos paper deliberately left out many engineering details ("a single-decree protocol"). The practical Multi-Paxos protocol used in production systems like Chubby is never fully specified — each implementation fills the gaps differently. Apache ZooKeeper took a different path with Zab, a leader-based atomic broadcast protocol related in spirit to Paxos but not the same protocol. Raft closes those engineering gaps in one concrete design.

§10

Real-world usage

Raft is now arguably the dominant consensus algorithm in new distributed systems work. Paxos dominates in legacy Google infrastructure (Chubby, Spanner's underlying protocol); Raft dominates in newer open-source infrastructure.

System	Role	Raft variant
etcd	Kubernetes configuration store, distributed lock	Reference Raft implementation; used by Kubernetes natively
CockroachDB	Distributed SQL database	Per-range Raft groups; each 64MB range is its own Raft cluster
TiKV / TiDB	Distributed key-value store, MySQL-compatible SQL	Multi-Raft; each Region (96MB shard) runs Raft independently
Consul	Service discovery, distributed configuration	Single Raft group for the entire cluster
InfluxDB	Time-series database	Raft for meta-node coordination
RethinkDB	Document database (deprecated)	Early Raft adopter; shaped the ecosystem
YugabyteDB	PostgreSQL-compatible distributed SQL	DocDB uses Raft per-tablet
MongoDB replica sets	Document database	Replica-set elections are Raft-inspired in some respects, but MongoDB does not run pure Raft

The most important insight from real deployments: the bottleneck is almost never the consensus protocol itself. Leader election takes milliseconds. Log replication overhead is a few percent. The hard problems in production Raft deployments are operational: detecting slow followers, managing log compaction (snapshotting), handling asymmetric network partitions, and tuning election timeouts for your latency environment.