Interactive explainer · Database internals

Transaction Control
from first principles

How does a database stay correct when a thousand transactions run at once — and a server crashes mid-write? Five mechanisms, one coherent story: isolation, durability, and distributed atomicity.

Isolation

2PL

pessimistic locking

Isolation

OCC

optimistic, validate on commit

Isolation

MVCC

snapshot reads, no read locks

Durability

WAL

write-ahead logging

Distributed

2PC / 3PC

atomic commit across nodes

ACID = Atomicity · Consistency · Isolation · Durability

§01

The problem: concurrency anomalies

Without any concurrency control, concurrent transactions produce incorrect results. The four anomalies below are the classic starting point, but they are not the whole story. Real isolation levels are ultimately judged by whether they permit serialization anomalies such as write skew, not just by this short list.

Anomaly explorer — click each to see what goes wrong

anomalies.js — simulating all four concurrency anomalies

// Simulate concurrency anomalies without any concurrency control
// Each function shows what goes wrong when two transactions interleave

function dirtyRead() {
  // T2 reads a value written by T1 before T1 commits
  // If T1 aborts, T2 has seen data that never existed
  let balance = 1000;
  const T1_write = () => { balance = 1500; }; // T1: transfer in (not yet committed)
  const T2_read  = () => balance;              // T2: reads T1's uncommitted write
  const T1_abort = () => { balance = 1000; }; // T1: rolls back
  T1_write();
  const seen = T2_read();  // T2 sees 1500
  T1_abort();
  console.log(`Dirty read: T2 saw ${seen}, actual=${balance} ← T2 used phantom money!`);
}

function nonRepeatableRead() {
  // T1 reads the same row twice; T2 modifies it between the reads
  let price = 100;
  const read1 = price;          // T1 first read
  price = 120;                  // T2 updates and commits
  const read2 = price;          // T1 second read — different result!
  console.log(`Non-repeatable read: T1 read price as ${read1} then ${read2} in one transaction`);
}

function phantomRead() {
  // T1 runs a query twice; T2 inserts rows matching the predicate between them
  let orders = [{ id:1, amount:50 }, { id:2, amount:80 }];
  const query = () => orders.filter(o => o.amount > 40);
  const scan1 = query();         // T1 first scan: 2 rows
  orders.push({ id:3, amount:200 }); // T2 inserts a new large order
  const scan2 = query();         // T1 second scan: 3 rows — phantom appeared!
  console.log(`Phantom read: first scan=${scan1.length} rows, second scan=${scan2.length} rows`);
}

function lostUpdate() {
  // Both T1 and T2 read-then-write; one write overwrites the other
  let views = 1000;
  const v1 = views;  // T1 reads 1000
  const v2 = views;  // T2 reads 1000
  views = v1 + 1;   // T1 writes 1001
  views = v2 + 1;   // T2 writes 1001 — T1's increment is LOST
  console.log(`Lost update: after 2 increments, views=${views} (should be 1002)`);
}

dirtyRead();
nonRepeatableRead();
phantomRead();
lostUpdate();

// SQL isolation levels and what they prevent:
const levels = [
  { level: "READ UNCOMMITTED", dirty: "✗", nonrep: "✗", phantom: "✗", lost: "✗" },
  { level: "READ COMMITTED",   dirty: "✓", nonrep: "✗", phantom: "✗", lost: "✗" },
  { level: "REPEATABLE READ",   dirty: "✓", nonrep: "✓", phantom: "✗", lost: "✓" },
  { level: "SERIALIZABLE",      dirty: "✓", nonrep: "✓", phantom: "✓", lost: "✓" },
];
console.log("\nIsolation levels (✓=prevented, ✗=possible):");
levels.forEach(l =>
  console.log(`  ${l.level.padEnd(20)} dirty=${l.dirty} nonrep=${l.nonrep} phantom=${l.phantom} lost=${l.lost}`));

§02

Two-Phase Locking (2PL)

2PL is the classic pessimistic approach. Every transaction must acquire a shared lock (S) before reading a resource and an exclusive lock (X) before writing. The "two phases" are: a growing phase where locks are only acquired, and a shrinking phase where locks are only released. Once you release any lock, you can't acquire new ones.

This strict ordering guarantees serializability — the execution is equivalent to some serial order. The cost: deadlocks. If T1 holds X on A and waits for B, while T2 holds X on B and waits for A, neither can proceed. The database must detect and break the cycle by aborting one transaction.

2PL lock manager — step through a scenario

two_phase_locking.js — lock manager with deadlock detection

// Two-Phase Locking: acquire locks (growing), then release (shrinking)
// Strict 2PL: hold ALL locks until commit/abort (used by most real DBs)

class LockManager {
  constructor() {
    this.locks   = new Map(); // resource → { holders: Set, waiters: [] }
    this.txLocks = new Map(); // txId → Set of resources held
    this.waitsFor= new Map(); // txId → txId (waiting for)
  }

  _getLock(resource) {
    if (!this.locks.has(resource))
      this.locks.set(resource, { mode: null, holders: new Set(), waiters: [] });
    return this.locks.get(resource);
  }

  acquire(txId, resource, mode) {
    const lock = this._getLock(resource);
    const compatible = (existing, requested) =>
      existing === null || (existing === 'S' && requested === 'S');

    if (compatible(lock.mode, mode) || lock.holders.has(txId)) {
      lock.holders.add(txId);
      lock.mode = lock.holders.size > 1 ? 'S' : mode;
      if (!this.txLocks.has(txId)) this.txLocks.set(txId, new Set());
      this.txLocks.get(txId).add(resource);
      console.log(`  T${txId} acquired ${mode}-lock on ${resource}`);
      return true;
    }
    // Must wait — record waits-for for deadlock detection
    lock.waiters.push({ txId, mode });
    this.waitsFor.set(txId, [...lock.holders][0]);
    console.log(`  T${txId} WAITING for ${mode}-lock on ${resource} (held by T${[...lock.holders][0]})`);
    return false;
  }

  release(txId) {
    const held = this.txLocks.get(txId) || new Set();
    for (const res of held) {
      const lock = this.locks.get(res);
      lock.holders.delete(txId);
      if (lock.holders.size === 0) {
        lock.mode = null;
        const next = lock.waiters.shift();
        if (next) {
          lock.holders.add(next.txId); lock.mode = next.mode;
          this.waitsFor.delete(next.txId);
          console.log(`  T${next.txId} unblocked — acquired ${next.mode}-lock on ${res}`);
        }
      }
    }
    this.txLocks.delete(txId);
    console.log(`  T${txId} released all locks`);
  }

  detectDeadlock() {
    // DFS on waits-for graph looking for a cycle
    const visited = new Set(), stack = new Set();
    const hasCycle = (node) => {
      if (stack.has(node)) return true;
      if (visited.has(node)) return false;
      visited.add(node); stack.add(node);
      const next = this.waitsFor.get(node);
      if (next && hasCycle(next)) return true;
      stack.delete(node); return false;
    };
    for (const tx of this.waitsFor.keys())
      if (hasCycle(tx)) return true;
    return false;
  }
}

const lm = new LockManager();
console.log("=== Normal 2PL scenario ===");
lm.acquire(1, "account_A", "S"); // T1 reads A
lm.acquire(2, "account_A", "S"); // T2 reads A — compatible, both get S-lock
lm.acquire(1, "account_B", "X"); // T1 writes B — no one holds B, granted
lm.acquire(2, "account_B", "X"); // T2 wants X on B — must wait for T1
console.log(`Deadlock detected: ${lm.detectDeadlock()}`);
lm.release(1);                      // T1 commits, releases all locks

console.log("\n=== Deadlock scenario ===");
const lm2 = new LockManager();
lm2.acquire(1, "A", "X"); // T1 holds A
lm2.acquire(2, "B", "X"); // T2 holds B
lm2.acquire(1, "B", "X"); // T1 waits for B (held by T2)
lm2.acquire(2, "A", "X"); // T2 waits for A (held by T1) → CYCLE
console.log(`Deadlock detected: ${lm2.detectDeadlock()} → abort one transaction`);

§03

Optimistic Concurrency Control (OCC)

OCC assumes conflicts are rare. Instead of blocking on locks, every transaction executes freely — reads and writes go to a private workspace. At commit time, a validation phase checks whether any conflicts actually occurred. If not, the writes are installed. If yes, the transaction is aborted and retried.

OCC has three phases: Read (execute freely, track read-set and write-set), Validate (check for conflicts with recently committed transactions), Write (install changes atomically if validation passes). A correct validator must reject both read/write conflicts and write/write conflicts that would otherwise collapse into last-writer-wins.

OCC validation — run transactions and see which pass

occ.js — optimistic concurrency control with validation

// Optimistic Concurrency Control — three phases: Read, Validate, Write
// No locks during execution. Conflicts detected only at commit time.

class OCCManager {
  constructor(db) {
    this.db        = { ...db };    // shared database state
    this.committed = [];           // list of recently committed txns {writeSet, endTime}
    this.clock     = 0;
  }

  begin() {
    return { id: ++this.clock, startTime: this.clock, readSet: new Set(), writeSet: new Map(), workspace: { ...this.db } };
  }

  read(tx, key) {
    tx.readSet.add(key);
    return tx.workspace[key]; // reads from private workspace snapshot
  }

  write(tx, key, value) {
    tx.writeSet.set(key, value);
    tx.workspace[key] = value; // buffered in private workspace, not committed yet
  }

  validate(tx) {
    tx.validationTime = ++this.clock;
    // For every committed txn whose interval overlaps ours:
    // its write-set must not intersect our read-set or write-set
    for (const committed of this.committed) {
      if (committed.endTime <= tx.startTime) continue; // committed before we started
      for (const key of committed.writeSet.keys()) {
        if (tx.readSet.has(key)) {
          console.log(`  T${tx.id} CONFLICT: committed T${committed.id} wrote '${key}' which T${tx.id} read → ABORT`);
          return false;
        }
        if (tx.writeSet.has(key)) {
          console.log(`  T${tx.id} CONFLICT: committed T${committed.id} also wrote '${key}' → ABORT`);
          return false;
        }
      }
    }
    return true;
  }

  commit(tx) {
    if (!this.validate(tx)) return false;
    // Write phase: install all buffered writes atomically
    for (const [key, val] of tx.writeSet)
      this.db[key] = val;
    tx.endTime = ++this.clock;
    this.committed.push(tx);
    console.log(`  T${tx.id} COMMITTED. DB: ${JSON.stringify(this.db)}`);
    return true;
  }
}

const occ = new OCCManager({ balance_A: 1000, balance_B: 500 });

console.log("=== Scenario 1: No conflict (different keys) ===");
const t1 = occ.begin();
const t2 = occ.begin();
occ.read(t1, "balance_A");  occ.write(t1, "balance_A", 900);
occ.read(t2, "balance_B");  occ.write(t2, "balance_B", 600);
occ.commit(t1); // T1 commits first
occ.commit(t2); // T2 commits — no overlap in read/write sets

console.log("\n=== Scenario 2: Write-Read conflict ===");
const occ2 = new OCCManager({ stock: 10 });
const ta = occ2.begin();
const tb = occ2.begin();
occ2.read(ta, "stock");   occ2.write(ta, "stock", 9);   // buy 1 unit
occ2.read(tb, "stock");   occ2.write(tb, "stock", 9);   // also buy 1 unit
occ2.commit(ta); // ta commits: stock = 9
occ2.commit(tb); // tb: ta wrote 'stock' which tb read → ABORT, retry!

§04

Multi-Version Concurrency Control (MVCC)

MVCC is the approach used by PostgreSQL, MySQL InnoDB, Oracle, and most modern databases. Instead of locking rows during reads, each transaction gets a consistent snapshot of the database as it existed at the transaction's start time. Writes create new versions; readers see the version that was current when their transaction began, plus their own writes.

The easiest way to picture this is as a version chain per row. A row is not overwritten in place. When a transaction updates price, the database keeps the old version and appends a new one. A reader walks backward through that chain and asks: “Which version was alive at the moment my snapshot was taken?” The first visible one is the value that transaction reads.

That is what the metadata means in the figure below. xmin is the transaction ID that created this version. xmax is the transaction ID that later replaced or deleted it. So a version is visible to a snapshot if it was born early enough (xmin is old enough) and not yet dead at that snapshot point (xmax is still in the future, or ∞ if it is the current version).

The result: readers never block writers, and writers never block readers. A long-running read-only query never holds up a write transaction. The cost is version storage — old versions must eventually be garbage-collected (PostgreSQL calls this VACUUM). Snapshot-based MVCC is great for read concurrency, but it is not automatically serializable; write skew is the classic counterexample.

Snapshot isolation prevents many familiar anomalies, but two transactions can still each read an old consistent snapshot, make different writes, and jointly violate a business rule. That is why systems like PostgreSQL still distinguish REPEATABLE READ from true SERIALIZABLE.

MVCC version chain — click a transaction to see what it reads

mvcc.js — snapshot isolation with a version chain

// MVCC: every write creates a new version; reads see the snapshot at tx start
// plus the transaction's own writes. This is snapshot isolation, not full serializability.
// Used by: PostgreSQL, MySQL InnoDB, Oracle, CockroachDB (with engine-specific details)

class MVCCDatabase {
  constructor() {
    this.versions = new Map(); // key → [{value, xmin, xmax}] (version chain)
    this.txCounter = 0;
    this.activeIds  = new Set();
  }

  begin() {
    const txId = ++this.txCounter;
    this.activeIds.add(txId);
    // Snapshot = current committed txId + set of in-progress txns at begin time
    return { txId, snapshotMin: txId, activeTxns: new Set(this.activeIds) };
  }

  write(key, value, txId) {
    if (!this.versions.has(key)) this.versions.set(key, []);
    const chain = this.versions.get(key);
    // Mark previous head as deleted (set xmax) when this tx commits
    if (chain.length) chain[chain.length-1].xmax = txId;
    chain.push({ value, xmin: txId, xmax: Infinity });
    console.log(`  T${txId} wrote '${key}' = ${JSON.stringify(value)} (version xmin=${txId})`);
  }

  read(key, snapshot) {
    const chain = this.versions.get(key) || [];
    // Find newest version visible to this snapshot
    for (let i = chain.length - 1; i >= 0; i--) {
      const v = chain[i];
      const selfVisible = v.xmin === snapshot.txId;
      const creatorCommitted = v.xmin < snapshot.snapshotMin &&
                               !snapshot.activeTxns.has(v.xmin);
      const notYetDeleted = v.xmax > snapshot.snapshotMin ||
                             snapshot.activeTxns.has(v.xmax);
      if ((selfVisible || creatorCommitted) && notYetDeleted) return v.value;
    }
    return undefined; // no visible version
  }

  commit(tx) {
    this.activeIds.delete(tx.txId);
    console.log(`  T${tx.txId} committed`);
  }

  vacuum(oldestActiveTx) {
    // Remove versions invisible to all active transactions (PostgreSQL VACUUM)
    let removed = 0;
    for (const [key, chain] of this.versions)
      this.versions.set(key, chain.filter(v => {
        const dead = v.xmax <= oldestActiveTx; if (dead) removed++;
        return !dead;
      }));
    console.log(`  VACUUM removed ${removed} dead version(s)`);
  }
}

const db = new MVCCDatabase();
// Seed initial data (committed before any tx)
db.versions.set("price", [{ value: 100, xmin: 0, xmax: Infinity }]);

console.log("=== MVCC snapshot isolation ===");
const longReader = db.begin();          // T1: starts a long read-only report
const writer     = db.begin();          // T2: will update the price

db.write("price", 120, writer.txId);    // T2 writes new version
db.commit(writer);                      // T2 commits

// T1 (longReader) still sees the OLD version because it started before T2
const seen = db.read("price", longReader);
console.log(`  T1 (longReader) sees price = ${seen} (started before write, sees old version)`);

const freshReader = db.begin();
const seen2 = db.read("price", freshReader);
console.log(`  T3 (freshReader) sees price = ${seen2} (started after write, sees new version)`);

db.commit(longReader);
db.vacuum(3); // clean up versions invisible to all active txns

§05

Write-Ahead Logging (WAL)

WAL (also called the "redo log" in Oracle/MySQL) solves a different problem: durability. If a server crashes after a transaction commits but before the actual data pages have been flushed to disk, that data is lost. WAL prevents this with one rule:

Before any data page is written to disk, its log record must be written to disk first. On crash, replaying the log recovers any committed transaction that hadn't yet reached the data pages.

Every change is first appended to a sequential log file (fast sequential I/O), with each record assigned a monotonically increasing Log Sequence Number (LSN). At commit, only the log needs to be flushed — the data pages can be written lazily in the background. Recovery logic varies by engine: ARIES is one important design (Analysis → Redo → Undo), while other systems describe crash recovery more simply as replaying WAL from the last checkpoint and consulting transaction status.

WAL log — simulate writes, crash, and recovery

wal.js — write-ahead log with crash recovery (ARIES-style)

// Write-Ahead Logging — durability through sequential log writes
// ARIES recovery: Analysis → Redo → Undo

class WALEngine {
  constructor() {
    this.log          = [];     // the WAL — sequential, append-only
    this.dataPages     = {};    // in-memory "disk" (may lag behind log)
    this.activeTxns    = {};    // txId → { writes: [] }
    this.lsn           = 0;    // Log Sequence Number — ever-increasing
    this.flushedLSN    = 0;    // last LSN synced to durable storage
    this.lastCheckpoint= 0;
  }

  _appendLog(type, txId, data = {}) {
    const entry = { lsn: ++this.lsn, type, txId, timestamp: Date.now(), ...data };
    this.log.push(entry);
    return entry.lsn;
  }

  begin(txId) {
    this.activeTxns[txId] = { writes: [], prevLSN: 0 };
    this._appendLog("BEGIN", txId);
    console.log(`[LSN ${this.lsn}] BEGIN T${txId}`);
  }

  write(txId, key, oldVal, newVal) {
    // Write to log FIRST (before touching data pages)
    const lsn = this._appendLog("UPDATE", txId, { key, oldVal, newVal });
    this.activeTxns[txId].writes.push({ key, oldVal, lsn });
    // Data page write can be delayed (lazy/background)
    this.dataPages[key] = newVal;  // in practice: buffered, flushed later
    console.log(`[LSN ${lsn}] UPDATE T${txId}: ${key} ${oldVal}→${newVal}`);
  }

  commit(txId) {
    const lsn = this._appendLog("COMMIT", txId);
    // Force-flush log up to this LSN before returning to client
    this.flushedLSN = lsn;
    delete this.activeTxns[txId];
    console.log(`[LSN ${lsn}] COMMIT T${txId} — log flushed to LSN ${lsn}`);
  }

  checkpoint() {
    // Fuzzy checkpoint: log active txns, current LSN — don't stop the world
    const active = Object.keys(this.activeTxns);
    this.lastCheckpoint = this.lsn;
    this._appendLog("CHECKPOINT", null, { activeTxns: active });
    console.log(`[LSN ${this.lsn}] CHECKPOINT — active txns: [${active.map(t=>'T'+t).join(',')||'none'}]`);
  }

  recover() {
    console.log("\n=== ARIES RECOVERY ===");
    // Phase 1: Analysis — find checkpoint, determine active txns at crash
    const cpEntry = [...this.log].reverse().find(e => e.type === "CHECKPOINT");
    const winner = new Set();  // committed txns to redo
    const loser  = new Set();  // active-at-crash txns to undo
    const startLSN = cpEntry ? cpEntry.lsn : 0;
    for (const e of this.log.filter(e => e.lsn > startLSN)) {
      if (e.type === "BEGIN")    loser.add(e.txId);
      if (e.type === "COMMIT")  { loser.delete(e.txId); winner.add(e.txId); }
      if (e.type === "ABORT")   loser.delete(e.txId);
    }
    console.log(`Analysis: REDO txns=[${[...winner].map(t=>'T'+t)}], UNDO txns=[${[...loser].map(t=>'T'+t)}]`);

    // Phase 2: Redo — replay all updates for winners (ensures durability)
    const db = {};
    for (const e of this.log) {
      if (e.type === "UPDATE" && winner.has(e.txId)) db[e.key] = e.newVal;
    }
    console.log("After Redo, DB:", JSON.stringify(db));

    // Phase 3: Undo — roll back all writes of loser txns (ensures atomicity)
    for (const e of [...this.log].reverse()) {
      if (e.type === "UPDATE" && loser.has(e.txId)) db[e.key] = e.oldVal;
    }
    console.log("After Undo, DB:", JSON.stringify(db));
    return db;
  }
}

const wal = new WALEngine();
wal.begin(1);
wal.begin(2);
wal.write(1, "balance_A", 1000, 900);
wal.write(2, "balance_B", 500,  600);
wal.commit(1);   // T1 commits — log flushed, durable
wal.checkpoint(); // T2 still active
wal.write(2, "balance_C", 200, 300);
// --- CRASH HERE --- T2 was still active ---
console.log("\n*** SERVER CRASHES (T2 never committed) ***");
const recovered = wal.recover();
// T1 is redone (committed), T2 is undone (was in-flight)

§06

Two-Phase Commit (2PC)

So far, everything has been single-node. In a distributed system (sharded databases, microservices writing to multiple stores), you need a way to commit a transaction across multiple nodes atomically — either all commit or all abort. 2PC achieves this with a coordinator and one or more participants.

Phase 1 (Prepare): coordinator sends "prepare to commit" to all participants. Each participant durably writes its work to its WAL and replies "vote YES" (I'm ready) or "vote NO" (I can't). Phase 2 (Commit/Abort): if all voted YES, coordinator sends "COMMIT" to all. If any voted NO, it sends "ABORT" to all.

The fatal flaw of 2PC: if the coordinator crashes after participants vote YES but before the coordinator's COMMIT reaches them, participants are stuck in the "prepared" state — holding locks, unable to proceed, waiting forever for a decision that may never come. This is called the blocking problem.

2PC step-by-step — choose a scenario

two_phase_commit.js — 2PC with coordinator failure simulation

// Two-Phase Commit — atomic commit across multiple nodes
// Shows both the happy path and the coordinator-failure blocking scenario

class Participant {
  constructor(name, willVoteYes = true) {
    this.name     = name;
    this.state    = "idle";
    this.prepared = null; // written to WAL when prepared
    this.willVote = willVoteYes;
  }

  prepare(txId, writes) {
    if (!this.willVote) {
      console.log(`  ${this.name}: VOTE NO (constraint violation)`);
      this.state = "aborted";
      return false;
    }
    // Force-write prepare record to WAL — cannot undo this vote
    this.prepared = { txId, writes };
    this.state    = "prepared";
    console.log(`  ${this.name}: VOTE YES (wrote prepare record to WAL)`);
    return true;
  }

  commit() {
    if (this.state !== "prepared") throw new Error("Not prepared!");
    this.state = "committed";
    console.log(`  ${this.name}: COMMITTED`);
  }

  abort() {
    this.state = "aborted";
    console.log(`  ${this.name}: ABORTED`);
  }
}

class Coordinator {
  constructor(participants) { this.participants = participants; }

  commit(txId, writesPerParticipant) {
    console.log(`[Phase 1 — PREPARE] TxId=${txId}`);
    const votes = this.participants.map((p, i) =>
      p.prepare(txId, writesPerParticipant[i] || []));
    const decision = votes.every(v => v) ? "COMMIT" : "ABORT";
    console.log(`[Coordinator decision: ${decision}]`);
    console.log(`[Phase 2 — ${decision}]`);
    // Coordinator writes decision to its OWN WAL before sending
    this.participants.forEach(p => decision === "COMMIT" ? p.commit() : p.abort());
    return decision;
  }
}

console.log("=== Happy path: all vote YES ===");
const p1 = [new Participant("node-A"), new Participant("node-B"), new Participant("node-C")];
const coord1 = new Coordinator(p1);
coord1.commit("tx-001", [["w(A,1)"],["w(B,2)"],["w(C,3)"]]);

console.log("\n=== One participant votes NO ===");
const p2 = [new Participant("node-A"), new Participant("node-B", false), new Participant("node-C")];
const coord2 = new Coordinator(p2);
coord2.commit("tx-002", [[],[],[]]);

console.log("\n=== Blocking scenario: coordinator crashes after phase 1 ===");
const p3 = [new Participant("node-A"), new Participant("node-B")];
console.log("Phase 1: all vote YES...");
p3.forEach(p => p.prepare("tx-003", []));
console.log("*** COORDINATOR CRASHES before sending Phase 2 ***");
console.log("node-A state:", p3[0].state, "← BLOCKED: holds locks, cannot commit or abort");
console.log("node-B state:", p3[1].state, "← BLOCKED: holds locks, cannot commit or abort");
console.log("Resolution: wait for coordinator recovery, or contact other participants");

§07

Three-Phase Commit (3PC)

3PC adds a middle phase between "prepared" and "committed" to eliminate the blocking problem. The insight: 2PC blocks because participants can't tell whether the coordinator crashed before or after deciding. 3PC makes the decision observable by adding a pre-commit phase.

Phase 1 (CanCommit): coordinator asks "can you commit?". Phase 2 (PreCommit): if all say yes, coordinator sends "prepare to commit" — this is new. Now every participant knows all others have voted YES. Phase 3 (DoCommit): coordinator sends actual commit. If the coordinator crashes during Phase 3, any participant can take over — they know every participant is in pre-committed state, so committing is safe.

3PC solves the blocking problem but assumes a synchronous network with bounded delays. In practice, network partitions make this assumption invalid — a partitioned participant can't distinguish "coordinator crashed" from "message delayed". This is why 3PC is rarely used in real systems. Modern systems use Paxos or Raft instead.

2PC vs 3PC — which states can a participant be in on coordinator failure?

three_phase_commit.js — 3PC with coordinator failure recovery

// Three-Phase Commit — non-blocking (under synchronous networks)
// Adds a PRE-COMMIT phase so all participants know everyone voted YES

class ThreePCParticipant {
  constructor(name, willVote = true) {
    this.name     = name;
    this.state    = "idle";  // idle → prepared → pre-committed → committed/aborted
    this.willVote = willVote;
  }

  canCommit() {
    if (!this.willVote) { console.log(`  ${this.name}: NO`); return false; }
    console.log(`  ${this.name}: YES to canCommit`);
    this.state = "prepared";
    return true;
  }

  preCommit() {
    // Receiving preCommit means ALL others have voted YES
    // Now I can commit unilaterally if coordinator disappears
    this.state = "pre-committed";
    console.log(`  ${this.name}: ack preCommit (now knows all voted YES)`);
  }

  doCommit() {
    this.state = "committed";
    console.log(`  ${this.name}: COMMITTED`);
  }

  doAbort() {
    this.state = "aborted";
    console.log(`  ${this.name}: ABORTED`);
  }

  recoverWithPeers(peers) {
    // If coordinator is gone, check peer states to decide
    if (this.state === "pre-committed") {
      // If I'm pre-committed, everyone voted YES — safe to commit
      const anyAborted = peers.some(p => p.state === "aborted");
      if (anyAborted) { this.doAbort(); }
      else            { this.doCommit(); }
    } else if (this.state === "prepared") {
      // If I'm only prepared (not pre-committed), someone may have aborted
      const anyPreCommit = peers.some(p => p.state === "pre-committed" || p.state === "committed");
      if (anyPreCommit) { this.doCommit(); }
      else              { this.doAbort(); } // safe to abort — no one pre-committed yet
    }
  }
}

class ThreePCCoordinator {
  constructor(participants) { this.participants = participants; }

  run(txId, crashAfterPhase = null) {
    // Phase 1: CanCommit?
    console.log(`[Phase 1 — CAN COMMIT? TxId=${txId}]`);
    const votes = this.participants.map(p => p.canCommit());
    if (!votes.every(v => v)) {
      this.participants.forEach(p => p.doAbort());
      return;
    }
    if (crashAfterPhase === 1) { console.log("*** COORDINATOR CRASHES after Phase 1 ***"); return; }

    // Phase 2: PreCommit — tell everyone all voted YES
    console.log(`[Phase 2 — PRE-COMMIT]`);
    this.participants.forEach(p => p.preCommit());
    if (crashAfterPhase === 2) { console.log("*** COORDINATOR CRASHES after Phase 2 ***"); return; }

    // Phase 3: DoCommit
    console.log(`[Phase 3 — DO COMMIT]`);
    this.participants.forEach(p => p.doCommit());
  }
}

// 3PC crash after Phase 1 — participants self-resolve (safe to abort)
console.log("=== 3PC: coordinator crashes after Phase 1 ===");
const ps1 = [new ThreePCParticipant("A"), new ThreePCParticipant("B")];
new ThreePCCoordinator(ps1).run("tx-1", 1);
console.log("Peer recovery:");
ps1[0].recoverWithPeers([ps1[1]]);
ps1[1].recoverWithPeers([ps1[0]]);

// 3PC crash after Phase 2 — all pre-committed, safe to commit
console.log("\n=== 3PC: coordinator crashes after Phase 2 ===");
const ps2 = [new ThreePCParticipant("A"), new ThreePCParticipant("B")];
new ThreePCCoordinator(ps2).run("tx-2", 2);
console.log("Peer recovery:");
ps2[0].recoverWithPeers([ps2[1]]);
ps2[1].recoverWithPeers([ps2[0]]);
console.log("Both committed without coordinator — non-blocking!");

§08

Comparison and real systems

All five mechanisms solve different parts of the same problem. Production systems combine several, but not in one universal recipe. PostgreSQL is MVCC + WAL, with optional serializable conflict tracking and optional 2PC for prepared transactions. InnoDB mixes MVCC snapshots with locking. Oracle has its own multiversion design. SQL Server is lock-based by default, with row-versioned modes available. Spanner and CockroachDB layer MVCC on top of consensus.

Mechanism	Solves	Read blocks write?	Write blocks read?	Overhead	Used in
2PL	Isolation	Yes (S-lock)	Yes (X-lock)	Lock mgmt + deadlock detect	SQL Server default, classic lock managers, InnoDB locking
OCC	Isolation	No	No	Abort + retry under contention	Datomic, some key-value stores
MVCC	Isolation	No	No	Version storage + GC (VACUUM)	PostgreSQL, Oracle, CockroachDB, InnoDB snapshots
WAL	Durability	N/A — not about concurrency		Sequential log I/O	All serious databases
2PC	Distributed atomicity	N/A — cross-node coordination		Blocking on coord failure	XA / prepared transactions / distributed coordinators
3PC	Non-blocking dist. atomicity	N/A		Network partition unsafe	Rarely; superseded by Paxos/Raft
Paxos/Raft	Distributed consensus	N/A		Partition-tolerant, live	Spanner, CockroachDB, etcd, MongoDB

How isolation protocols compare under contention

Throughput under varying contention — adjust conflict rate

Conflict rate 20%

A safer summary: most serious OLTP databases combine some isolation scheme + WAL durability, but the exact isolation mix differs a lot by engine. PostgreSQL, InnoDB, Oracle, and SQL Server do not all implement the same concurrency recipe. For distributed transactions, 2PC remains common when you already have a trusted coordinator, while Paxos/Raft-based systems (Spanner, CockroachDB) fold atomicity into replicated consensus.