Part Seven · Chapter 22 · 1 of 3

The road to
Transformers

Seventy years of wrong turns, abandoned ideas, and sudden clarity — the chain of intuitions that led from Shannon's coin flips to the attention mechanism. Each section shows the model that era produced, running in your browser.

Contents

§01 Shannon & the statistics of language
§02 N-gram models
§03 The Perceptron and its limits
§04 Backpropagation & the revival
§05 Recurrent networks & sequences
§06 The vanishing gradient problem
§07 LSTMs: memory with gates
§08 Word embeddings & meaning as geometry
§09 Sequence-to-sequence & the bottleneck
§10 Attention: a soft, learnable lookup
§11 Removing recurrence entirely

1948

§01

1948Shannon & the statistics of language

Claude Shannon's 1948 paper A Mathematical Theory of Communication did not set out to explain language. It set out to understand how much information a channel could carry. But in doing so it established the foundational idea that would eventually lead to LLMs: language is a probability distribution over sequences of symbols.

Shannon's key demonstration was entropy — a measure of how unpredictable the next symbol in a sequence is. He showed that English has an entropy of roughly 1 bit per character: knowing the preceding characters, the next one is not random — it is heavily constrained. The letter q is almost always followed by u. The word the is far more likely after a period than after another the.

Shannon made this vivid with a thought experiment: he asked human subjects to guess the next letter in a text, measuring how often they were right. Humans, steeped in statistical regularities of English, performed far above chance — demonstrating that language has deep probabilistic structure. A machine that could learn that structure would be a machine that could model language.

Shannon's guessing game — predict the next character:

entropy.js — measuring language entropy JS

// Shannon entropy: H = -Σ p(x) log₂ p(x)
// Measures the average surprise (bits) of the next symbol.

function entropy(freq) {
  const total = Object.values(freq).reduce((a,b)=>a+b, 0);
  return -Object.values(freq).reduce((sum, count) => {
    if (count === 0) return sum;
    const p = count / total;
    return sum + p * Math.log2(p);
  }, 0);
}

function charFreq(text) {
  const f = {};
  for (const c of text.toLowerCase()) {
    if (/[a-z ]/.test(c)) f[c] = (f[c]||0) + 1;
  }
  return f;
}

// Compare entropy across different text sources
const sources = [
  { name: 'Random letters',   text: Array.from({length:500},()=>String.fromCharCode(97+Math.floor(Math.random()*26))).join('') },
  { name: 'English prose',    text: 'the cat sat on the mat a big dog ran on the path the sun set over the hill and the wind blew cold through the trees the old man walked slowly down the long road' },
  { name: 'Repetitive text',  text: 'the the the the the cat cat cat the the mat mat mat the the the the the cat cat sat sat sat on on the the mat' },
  { name: 'Code (JS)',        text: 'function add a b return a plus b const x equals add one two console log x if x greater three then doSomething else doOther end' },
];

console.log('Shannon entropy by text type (bits per character):\n');
console.log('Source'.padEnd(22) + 'Entropy   Predictability');
console.log('─'.repeat(52));
for (const { name, text } of sources) {
  const freq = charFreq(text);
  const H = entropy(freq);
  const maxH = Math.log2(Object.keys(freq).length);
  const predictability = (1 - H/maxH) * 100;
  const bar = '█'.repeat(Math.round(predictability/5));
  console.log(`${name.padEnd(22)}${H.toFixed(3)} bits   ${predictability.toFixed(0).padStart(3)}% ${bar}`);
}

console.log('\nKey insight: lower entropy = more predictable = easier to model.');
console.log('English prose is ~40-50% predictable from character frequencies alone.');

This statistical view of language was not universally accepted. Chomsky would argue in 1959 that language has generative structure that no finite-state model could capture — a productive critique that sent linguistics in a different direction for decades. But the probabilistic view quietly persisted in engineering. Every spell-checker, speech recognizer, and machine translation system built between 1970 and 2010 was built on Shannon's foundation.

1950s–

§02

1950sN-gram models first working LM

Shannon's theory needed an implementation. The simplest one was the n-gram model: estimate P(word | previous n−1 words) by counting how often that sequence appeared in a large corpus, then normalize.

A unigram model knows only word frequencies. A bigram model tracks pairs. A trigram model tracks triples. Each step buys more context at the cost of more data — a trigram model on a 100,000-word vocabulary requires tracking 10¹⁵ possible triples, most of which will never appear. Sparse data is the central problem with n-grams.

Practitioners developed elaborate smoothing techniques — Kneser-Ney, Good-Turing, Katz backoff — to handle the inevitable zero-count problem: what probability should be assigned to sequences never seen in training? These techniques worked well enough that n-gram language models ran at the heart of commercial speech recognition and machine translation systems from the 1980s through the early 2010s.

But they have a hard ceiling. A trigram model cannot learn that "the dog chased the cat" and "the cat was pursued by the dog" are semantically equivalent — they share no n-gram overlap. Context is bounded at three words regardless of the sentence. No amount of smoothing fixes this architectural limit.

ngram.js — bigram and trigram language models JS

// N-gram language model with Laplace (add-1) smoothing.
// Simple but fully functional — this is the pre-neural baseline.

function buildNgram(tokens, n) {
  const counts = {};    // context → { next_word: count }
  const ctxCounts = {}; // context → total count

  for (let i = 0; i <= tokens.length - n; i++) {
    const ctx  = tokens.slice(i, i+n-1).join(' ');
    const next = tokens[i+n-1];
    if (!counts[ctx]) counts[ctx] = {};
    counts[ctx][next] = (counts[ctx][next] || 0) + 1;
    ctxCounts[ctx] = (ctxCounts[ctx] || 0) + 1;
  }
  return { counts, ctxCounts };
}

function ngramProb(model, vocabSize, ctx, word, alpha=0.1) {
  // Laplace smoothing: add alpha to every count
  const ctxCounts = model.ctxCounts[ctx] || 0;
  const wordCount = (model.counts[ctx] || {})[word] || 0;
  return (wordCount + alpha) / (ctxCounts + alpha * vocabSize);
}

function ngramGenerate(model, vocabSize, seed, vocab, n, steps=10) {
  let tokens = [...seed];
  for (let i = 0; i < steps; i++) {
    const ctx = tokens.slice(-(n-1)).join(' ');
    // Score all vocabulary words and sample
    const scores = vocab.map(w => ({w, p: ngramProb(model, vocabSize, ctx, w)}));
    const total  = scores.reduce((s,x)=>s+x.p, 0);
    let r = Math.random() * total, chosen = vocab[0];
    for (const {w, p} of scores) { r -= p; if (r<=0) { chosen=w; break; } }
    tokens.push(chosen);
  }
  return tokens;
}

// Training corpus
const corpus = ('the cat sat on the mat the big cat ran on the long mat ' +
                'a small dog sat on a mat the cat and the dog sat together ' +
                'the dog ran fast the cat ran faster the mat was on the floor').split(' ');
const vocab  = [...new Set(corpus)];
const V      = vocab.length;

// Train bigram and trigram models
const bigram  = buildNgram(corpus, 2);
const trigram = buildNgram(corpus, 3);

console.log(`Corpus: ${corpus.length} tokens, vocabulary: ${V} words\n`);

// Show top predictions after "the cat"
const ctx = 'the cat';
console.log(`Top 5 trigram predictions after "the cat":`);
const preds = vocab.map(w => ({ w, p: ngramProb(trigram, V, ctx, w) }))
  .sort((a,b)=>b.p-a.p).slice(0,5);
preds.forEach(({w,p}) => {
  const bar = '█'.repeat(Math.round(p*200));
  console.log(`  "${w}".padEnd(10) ${(p*100).toFixed(1).padStart(5)}%  ${bar}`);
});

// Generate text from each model
console.log('\nGenerated text (bigram):');
console.log(' ', ngramGenerate(bigram,  V, ['the','cat'], vocab, 2, 12).join(' '));
console.log('Generated text (trigram):');
console.log(' ', ngramGenerate(trigram, V, ['the','cat'], vocab, 3, 12).join(' '));

console.log('\nThe hard limit: context window is fixed at n-1 words.');
console.log(`Trigram can only look back ${3-1} words — no long-range dependencies.`);

1958

§03

1958The Perceptron and its limits

In 1958, Frank Rosenblatt built the Perceptron — a physical machine with 400 photocells that could learn to classify shapes. The New York Times reported it might "be able to walk, talk, see, write, reproduce itself and be conscious of its existence." The dream of a learning machine was suddenly vivid.

The Perceptron learns a linear classifier. Given inputs x₁...xₙ and weights w₁...wₙ, it computes Σ wᵢxᵢ, compares to a threshold, and outputs +1 or −1. When it gets a prediction wrong, it adjusts the weights by a small amount in the direction that would have been correct. This perceptron learning rule is provably convergent when the data is linearly separable.

The fatal critique: In 1969, Marvin Minsky and Seymour Papert published Perceptrons, proving that a single-layer perceptron cannot learn XOR — a function that requires a nonlinear boundary. Since many interesting problems are nonlinear, they argued the architecture was fundamentally limited. Funding evaporated. The first AI winter began.

In retrospect, the critique was correct but the conclusion too strong. The limitation was not perceptrons as such but single-layer perceptrons. Stacking layers creates nonlinear functions — but in 1969 nobody knew how to train them.

perceptron.js — learning rule and XOR failure JS

// The Perceptron learning algorithm (Rosenblatt, 1958).
// Demonstrates both its success on linearly separable data
// and its fundamental failure on XOR.

function perceptron(inputs, weights, bias) {
  const sum = inputs.reduce((s,x,i) => s + x*weights[i], bias);
  return sum >= 0 ? 1 : -1;
}

function trainPerceptron(data, lr=0.1, maxEpochs=100) {
  // Weights and bias start at zero; the algorithm discovers their values purely
  // from mistakes — no prior knowledge of the problem is baked in.
  let weights = [0, 0];
  let bias    = 0;

  // One epoch = one full pass through every training example.
  for (let epoch = 0; epoch < maxEpochs; epoch++) {
    let errors = 0;
    for (const [x, y] of data) {
      const pred = perceptron(x, weights, bias);
      if (pred !== y) {
        // We were wrong. (y - pred) is ±2:
        //   predicted -1, truth +1  →  y-pred = +2  →  weights grow toward x
        //   predicted +1, truth -1  →  y-pred = -2  →  weights shrink away from x
        // lr scales how large a step we take; a full ±2 step would overshoot.
        weights = weights.map((w,i) => w + lr * (y-pred) * x[i]);
        bias   += lr * (y - pred);
        errors++;
      }
    }
    // A full epoch with zero mistakes means the weights now separate every
    // example correctly — the perceptron convergence theorem guarantees this
    // happens in finite epochs *if* the data is linearly separable.
    if (errors === 0) return { weights, bias, epochs: epoch+1, converged: true };
  }
  return { weights, bias, epochs: maxEpochs, converged: false };
}

// ── AND: linearly separable — should converge ─────────────────────
const AND = [
  [[0,0], -1], [[0,1], -1], [[1,0], -1], [[1,1],  1],
];
const andResult = trainPerceptron(AND);
console.log('AND gate (linearly separable):');
console.log(`  Converged: ${andResult.converged} in ${andResult.epochs} epochs`);
AND.forEach(([x,y]) => {
  const pred = perceptron(x, andResult.weights, andResult.bias);
  console.log(`  input ${x} → predicted ${pred===1?'+1':'-1'}, actual ${y===1?'+1':'-1'} ${pred===y?'✓':'✗'}`);
});

// ── XOR: not linearly separable — will never converge ─────────────
const XOR = [
  [[0,0], -1], [[0,1],  1], [[1,0],  1], [[1,1], -1],
];
const xorResult = trainPerceptron(XOR, 0.1, 200);
console.log('\nXOR gate (NOT linearly separable):');
console.log(`  Converged: ${xorResult.converged} after ${xorResult.epochs} epochs`);
XOR.forEach(([x,y]) => {
  const pred = perceptron(x, xorResult.weights, xorResult.bias);
  console.log(`  input ${x} → predicted ${pred===1?'+1':'-1'}, actual ${y===1?'+1':'-1'} ${pred===y?'✓':'✗'}`);
});
console.log('\nThe perceptron cannot learn XOR — it requires a nonlinear decision boundary.');
console.log('Minsky & Papert, 1969: this single observation froze AI funding for 15 years.');

Why AND succeeds and XOR fails — plotting the four inputs

1986

§04

1986Backpropagation & the revival key unlock

The answer to XOR had been known since the 1960s: add a hidden layer. A layer is a set of nodes that each compute a weighted sum of their inputs and pass the result through a nonlinear function — the sigmoid, for instance, squashes any number into the range (0, 1). A single-layer perceptron can only draw one straight boundary; add a second layer and each node in it combines several of those boundaries, producing curves and kinks. A two-layer network can therefore learn any nonlinear boundary in principle. The problem was training it — how do you adjust the weights of a hidden layer that has no direct output to measure against?

The chain rule of calculus answers this: differentiate the loss with respect to each weight, following the computation graph backwards from output to input. This had been discovered and forgotten several times before Rumelhart, Hinton, and Williams published their landmark 1986 paper Learning representations by back-propagating errors — the version that finally stuck, because it came with compelling experiments on language, perception, and logic tasks.

Backpropagation made deep learning possible in principle. But the practice remained hard for another 25 years: computers were too slow, datasets too small, and the training of very deep networks still unstable. The 1986 revival produced interesting results but no commercial breakthroughs, and a second, milder winter followed in the 1990s.

mlp_xor.js — two-layer network solves XOR JS

// A two-layer network trained with backpropagation solves XOR.
// This is exactly what the 1986 Rumelhart paper demonstrated.

function sigmoid(x)     { return 1/(1+Math.exp(-x)); }
function sigmoidGrad(x) { const s=sigmoid(x); return s*(1-s); }

// Network: 2 inputs → 2 hidden → 1 output
// Initialize weights with small random values
function randn() { return (Math.random()-0.5)*0.5; }
let W1 = [[randn(),randn()],[randn(),randn()]]; // [2×2]
let b1 = [randn(), randn()];
let W2 = [[randn(),randn()]];                    // [1×2]
let b2 = [randn()];

function forward(x) {
  // Hidden layer
  const z1 = W1.map((row,i) => row.reduce((s,w,j)=>s+w*x[j], b1[i]));
  const a1  = z1.map(sigmoid);
  // Output layer
  const z2  = [W2[0].reduce((s,w,j)=>s+w*a1[j], b2[0])];
  const a2  = z2.map(sigmoid);
  return { z1, a1, z2, a2 };
}

function trainStep(data, lr=0.5) {
  let totalLoss = 0;
  // Zero gradients
  const dW1=[[0,0],[0,0]], db1=[0,0], dW2=[[0,0]], db2=[0];

  for (const [x, yTarget] of data) {
    const { z1, a1, z2, a2 } = forward(x);
    const loss = 0.5*(a2[0]-yTarget)**2;
    totalLoss += loss;

    // Output delta
    const delta2 = [(a2[0]-yTarget) * sigmoidGrad(z2[0])];
    // Hidden delta
    const delta1 = z1.map((_,i) => W2[0][i]*delta2[0] * sigmoidGrad(z1[i]));

    // Accumulate gradients
    W2[0].forEach((_,j) => { dW2[0][j] += delta2[0]*a1[j]; });
    db2[0] += delta2[0];
    W1.forEach((row,i) => row.forEach((_,j) => { dW1[i][j] += delta1[i]*x[j]; }));
    delta1.forEach((d,i) => { db1[i] += d; });
  }

  // Update weights
  W1.forEach((row,i) => row.forEach((_,j) => { W1[i][j]-=lr*dW1[i][j]; }));
  b1.forEach((_,i) => { b1[i]-=lr*db1[i]; });
  W2.forEach((row,i) => row.forEach((_,j) => { W2[i][j]-=lr*dW2[i][j]; }));
  b2.forEach((_,i) => { b2[i]-=lr*db2[i]; });

  return totalLoss/data.length;
}

const XOR = [[[0,0],0],[[0,1],1],[[1,0],1],[[1,1],0]];

let loss;
for (let ep=0; ep<5000; ep++) loss = trainStep(XOR);

console.log('Two-layer network trained on XOR with backpropagation:\n');
XOR.forEach(([x,y]) => {
  const pred = forward(x).a2[0];
  const label = pred > 0.5 ? 1 : 0;
  console.log(`  input [${x}] → output ${pred.toFixed(4)}  → class ${label}  target ${y} ${label===y?'✓':'✗'}`);
});
console.log(`\nFinal loss: ${loss.toFixed(6)}`);
console.log('The addition of one hidden layer + backprop solves what stumped Rosenblatt.');

1990s

§05

1990sRecurrent networks & sequences

Feedforward networks process one fixed-size input at a time. Language is sequential — a sentence is a series of words of variable length, and the meaning of word 10 can depend on word 1. To handle sequences, researchers turned to recurrent neural networks.

An RNN processes tokens one at a time, maintaining a hidden state — a vector that summarizes everything seen so far. At each step, the new hidden state is a function of the old one and the new input: hₜ = tanh(Wₕ · hₜ₋₁ + Wₓ · xₜ + b). The hidden state is the memory of the network.

This was elegant and it worked — RNNs could model sequences of arbitrary length with a fixed number of parameters. They powered early speech recognition, handwriting recognition, and text generation systems throughout the 1990s and early 2000s.

rnn.js — vanilla RNN forward pass JS

// Vanilla RNN — processes a sequence step by step,
// maintaining a hidden state vector.

function tanh(x)     { return Math.tanh(x); }
function tanhGrad(x) { return 1 - Math.tanh(x)**2; }

function matVecMul(M, v) {
  return M.map(row => row.reduce((s,w,j)=>s+w*v[j], 0));
}
function vecAdd(a, b) { return a.map((v,i)=>v+b[i]); }
function vecScale(v, s) { return v.map(x=>x*s); }

class VanillaRNN {
  constructor(inputSize, hiddenSize, outputSize) {
    const r = () => (Math.random()-0.5)*0.2;
    this.Wh = Array.from({length:hiddenSize},()=>Array.from({length:hiddenSize},r));
    this.Wx = Array.from({length:hiddenSize},()=>Array.from({length:inputSize },r));
    this.Wy = Array.from({length:outputSize},()=>Array.from({length:hiddenSize},r));
    this.bh = new Array(hiddenSize).fill(0);
    this.by = new Array(outputSize).fill(0);
    this.h  = new Array(hiddenSize).fill(0); // hidden state
  }

  step(x) {
    // hₜ = tanh(Wₕ·hₜ₋₁ + Wₓ·xₜ + bₕ)
    const fromH = matVecMul(this.Wh, this.h);
    const fromX = matVecMul(this.Wx, x);
    const preact = vecAdd(vecAdd(fromH, fromX), this.bh);
    this.h = preact.map(tanh);

    // yₜ = Wᵧ·hₜ + bᵧ (logits, no softmax here)
    return vecAdd(matVecMul(this.Wy, this.h), this.by);
  }

  resetState() { this.h.fill(0); }
}

// ── Demo: process "the cat sat" token by token ───────────────────
const VOCAB = ['<pad>','the','cat','sat','on','mat','dog','ran'];
const V = VOCAB.length;

// One-hot encode tokens
function oneHot(id, size) {
  const v = new Array(size).fill(0);
  v[id] = 1;
  return v;
}

const rnn = new VanillaRNN(V, 4, V);  // 4-dim hidden state
const sentence = ['the','cat','sat'];
const ids = sentence.map(w => VOCAB.indexOf(w));

console.log('RNN processes "the cat sat" token by token:\n');
console.log('Token      Hidden state (4 dims)                 Top prediction');
console.log('─'.repeat(70));

ids.forEach((id, t) => {
  const x = oneHot(id, V);
  const logits = rnn.step(x);

  // Softmax
  const max = Math.max(...logits);
  const exps = logits.map(x=>Math.exp(x-max));
  const sum  = exps.reduce((a,b)=>a+b,0);
  const probs = exps.map(p=>p/sum);
  const topIdx = probs.indexOf(Math.max(...probs));

  const hStr = rnn.h.map(v=>v.toFixed(2).padStart(6)).join(' ');
  console.log(`"${sentence[t]}".padEnd(10) [${hStr} ]  → "${VOCAB[topIdx]}"`);
});

console.log('\nThe hidden state evolves as each token is processed.');
console.log('It must compress the entire history into a fixed-size vector.');

1991

§06

1991The vanishing gradient problem

The RNN has a critical flaw. Hochreiter documented it carefully in his 1991 diploma thesis: when you backpropagate through many time steps, the gradient signal multiplies the same weight matrix at every step. If the spectral radius of that matrix (the magnitude of its largest eigenvalue — roughly, how much the matrix stretches vectors on average) is less than 1, the gradient decays exponentially — it vanishes before reaching early time steps. If greater than 1, it explodes.

This means an RNN trained with backpropagation through time cannot reliably learn long-range dependencies. "The trophy didn't fit in the suitcase because it was too big" — the antecedent of "it" is 8 words back. An RNN trained on this sentence will fail to connect it to trophy because the gradient signal linking them has shrunk to near zero by the time it reaches that far.

Gradient magnitude through RNN time steps:

vanishing.js — gradient flow through time JS

// Vanishing gradient in RNNs.
// During backprop through time, gradients are multiplied by
// dh_t/dh_{t-1} at each step. If this value < 1, signal vanishes.

// Simplified: track gradient magnitude through T steps
// dL/dh_0 = dL/dh_T * prod_{t=1}^{T} (dh_t/dh_{t-1})
// Each dh_t/dh_{t-1} ≈ tanh'(z_t) * W_h (simplified to scalar)

function simulateGradientFlow(T, W_h_norm, noise=0.02) {
  const grads = [1.0]; // gradient starts at 1 at the output
  let g = 1.0;
  for (let t = 0; t < T; t++) {
    // At each step, multiply by tanh'(z) * W_h
    // tanh'(z) is between 0 and 1 (max at z=0 is 1)
    // For a saturated unit tanh'(z) ≈ 0.3
    const tanhPrime = 0.4 + noise*(Math.random()-0.5); // ~saturated
    g *= tanhPrime * W_h_norm;
    grads.push(g);
  }
  return grads;
}

const T = 20; // sequence length

const configs = [
  { name: 'Stable   (‖Wh‖ = 1.0)', norm: 1.0  },
  { name: 'Vanishing(‖Wh‖ = 0.7)', norm: 0.7  },
  { name: 'Exploding(‖Wh‖ = 1.6)', norm: 1.6  },
];

for (const { name, norm } of configs) {
  const grads = simulateGradientFlow(T, norm, 0);
  console.log(`\n${name}`);
  console.log('Step:  ' + Array.from({length:T+1},(_,i)=>String(i).padStart(6)).join(''));
  console.log('|grad|:' + grads.map(g=>{
    if (g > 999) return '  >999';
    if (g < 0.001) return '  ≈0  ';
    return g.toFixed(3).padStart(6);
  }).join(''));
}

console.log('\nVanishing: gradient at step 0 is ~0 → early tokens get no training signal.');
console.log('Exploding: gradient at step 0 is huge → training diverges.');
console.log('\nThis is why vanilla RNNs cannot learn long-range dependencies.');

The vanishing gradient problem was identified in 1991 but widely discussed only after Bengio, Simard, and Frasconi's 1994 paper. It sat as an open problem for three years until Hochreiter and Schmidhuber produced their solution.

1997

§07

1997LSTMs: memory with gates solves vanishing gradient

Hochreiter and Schmidhuber's Long Short-Term Memory (1997) solved the vanishing gradient problem with a single elegant idea: a cell state that runs straight through time with only additive modifications, and gates that control what gets written to it, read from it, and erased.

The cell state is a conveyor belt. During backpropagation, there exists a gradient path through the cell state where updates are purely additive — no tanh derivatives are multiplied in, no sigmoid squashing. Unlike the hidden-state pathway, which accumulates products of values smaller than 1 at every step, the cell state offers a route that carries gradients backward without systematically shrinking them. The forget gate still multiplies (gradients can still be attenuated there), but the cell state gives the optimizer a highway around the vanishing-gradient bottleneck. The three gates — forget, input, output — are learned sigmoid functions that control what information survives. The forget gate decides what to erase. The input gate decides what new information to write. The output gate decides what to expose as the hidden state.

LSTM cell — step through gates:

lstm.js — LSTM cell from scratch JS

// LSTM cell — the four-gate architecture of Hochreiter & Schmidhuber (1997).
// Three innovations over vanilla RNN:
//   1. Cell state (c) — gradient highway with additive updates
//   2. Forget gate — learned selective erasure
//   3. Input/output gates — controlled read/write

function sigmoid(x)  { return 1/(1+Math.exp(-x)); }
function tanh(x)     { return Math.tanh(x); }

class LSTMCell {
  constructor(inputSize, hiddenSize) {
    const scale = 0.1;
    const r = () => (Math.random()-0.5)*scale;
    const mat = (rows,cols) => Array.from({length:rows},()=>Array.from({length:cols},r));
    const vec = n => new Array(n).fill(0);

    const D = hiddenSize, I = inputSize;
    // Weight matrices for all four gates: [forget, input, cell_gate, output]
    this.Wf=mat(D,D+I); this.bf=vec(D);  // forget gate
    this.Wi=mat(D,D+I); this.bi=vec(D);  // input gate
    this.Wg=mat(D,D+I); this.bg=vec(D);  // cell gate (candidate)
    this.Wo=mat(D,D+I); this.bo=vec(D);  // output gate

    // Initial states
    this.h = vec(D);  // hidden state
    this.c = vec(D);  // cell state
    this.D = D;
  }

  step(x) {
    // Concatenate [h, x]
    const hx = [...this.h, ...x];

    // Gate computations
    const f  = this.Wf.map((row,i) => sigmoid(row.reduce((s,w,j)=>s+w*hx[j], this.bf[i])));
    const i_ = this.Wi.map((row,i) => sigmoid(row.reduce((s,w,j)=>s+w*hx[j], this.bi[i])));
    const g  = this.Wg.map((row,i) => tanh   (row.reduce((s,w,j)=>s+w*hx[j], this.bg[i])));
    const o  = this.Wo.map((row,i) => sigmoid(row.reduce((s,w,j)=>s+w*hx[j], this.bo[i])));

    // Cell state update: additive! — the gradient highway
    // c_t = f * c_{t-1}  +  i * g
    //        ↑ forget         ↑ write new info
    this.c = this.c.map((_,k) => f[k]*this.c[k] + i_[k]*g[k]);

    // Hidden state
    this.h = this.c.map((v,k) => o[k]*tanh(v));

    return { f, i: i_, g, o, c: [...this.c], h: [...this.h] };
  }
}

// ── Demo: process a sentence, watch cell state evolve ─────────────
const VOCAB = ['<s>','the','cat','sat','on','mat','dog','ran','<e>'];
const V = VOCAB.length;
function oneHot(id) { const v=new Array(V).fill(0); v[id]=1; return v; }

const lstm = new LSTMCell(V, 3);  // 3-dim cell state
const sentence = ['the','cat','sat','on','the','mat'];
const ids = sentence.map(w=>VOCAB.indexOf(w));

console.log('LSTM processing "the cat sat on the mat"\n');
console.log('Token     Forget gate    Input gate     Cell state');
console.log('─'.repeat(60));

ids.forEach((id, t) => {
  const { f, i, c } = lstm.step(oneHot(id));
  const fStr = f.map(v=>v.toFixed(2)).join(' ');
  const iStr = i.map(v=>v.toFixed(2)).join(' ');
  const cStr = c.map(v=>v.toFixed(2)).join(' ');
  console.log(`"${sentence[t]}".padEnd(10) [${fStr}]  [${iStr}]  [${cStr}]`);
});

console.log('\nForget gate ≈ 1.0 → keep previous cell state');
console.log('Forget gate ≈ 0.0 → erase it');
console.log('Input gate  ≈ 1.0 → write new information');
console.log('\nThe cell state provides a gradient highway: a path through the network');
console.log('where updates are additive, not gated through tanh — so gradients');
console.log('reach early time steps without the systematic decay of plain RNNs.');

LSTMs remained the dominant sequence model for twenty years. They powered Google's neural machine translation (2016), Apple's Siri, and Amazon's Alexa. The GRU (Gated Recurrent Unit, 2014) simplified the architecture to two gates while achieving comparable results. Both are still used today in constrained environments where Transformer compute costs are prohibitive.

2013

§08

2013Word embeddings & meaning as geometry semantic geometry

In every neural network language model so far, words were one-hot vectors — orthogonal, equally distant, carrying no information about similarity. "Dog" and "puppy" were as different as "dog" and "democracy" in the input space. A critical insight was waiting to be harvested: words that appear in similar contexts must have similar meanings.

Mikolov et al.'s word2vec (2013) made this operational at scale. Train a shallow network to predict a word from its context (CBOW) or to predict context from a word (skip-gram). Discard the network after training. Keep the embedding matrix. The embeddings that emerge have striking geometric properties — semantic relationships are encoded as vector arithmetic:

king − man + woman ≈ queen

Paris − France + Germany ≈ Berlin

walked − walk + swim ≈ swam

This was not programmed in. It emerged from the statistical structure of text. The mechanism: if "man" and "woman" appear in similar contexts as each other, and "king" appears in similar contexts to "man" (royal, political, leader), then "queen" must be near the result of the arithmetic. The geometry of the embedding space mirrors the geometry of meaning.

Word2vec also introduced the idea of negative sampling: instead of computing a softmax over the entire vocabulary (expensive), train a binary classifier to distinguish real context words from random noise words. This made training on billions of words tractable on a single machine.

word2vec.js — skip-gram with negative sampling JS

// Skip-gram word2vec with negative sampling — from scratch.
// Train a shallow network to predict context from center word.
// The embeddings are the by-product we care about.

function sigmoid(x) { return 1/(1+Math.exp(-x)); }

const VOCAB = ['king','queen','man','woman','royal','crown','prince','princess',
               'dog','cat','pet','animal','run','walk','fast','slow'];
const V = VOCAB.length;
const D = 8;  // embedding dimension

// Initialize embeddings (center and context matrices)
const rand = () => (Math.random()-0.5)*0.1;
let W = Array.from({length:V},()=>Array.from({length:D},rand)); // center embeddings
let C = Array.from({length:V},()=>Array.from({length:D},rand)); // context embeddings

// Skip-gram training pairs (manually constructed for demo)
const pairs = [
  // royalty cluster
  [0,1],[0,4],[0,5],[0,6],  // king→queen,royal,crown,prince
  [1,0],[1,4],[1,5],[1,7],  // queen→king,royal,crown,princess
  [2,0],[2,6],[2,8],        // man→king,prince,dog
  [3,1],[3,7],[3,9],        // woman→queen,princess,cat
  // animal cluster
  [8,9],[8,10],[8,11],[8,12],[8,13],   // dog-cat-pet-animal-run-walk
  [9,8],[9,10],[9,11],[9,14],[9,15],   // cat-dog-pet-animal-fast-slow
];

function dot(a,b) { return a.reduce((s,v,i)=>s+v*b[i], 0); }
function cosineSim(a,b) {
  const d=dot(a,b), na=Math.sqrt(dot(a,a)), nb=Math.sqrt(dot(b,b));
  return d/(na*nb+1e-10);
}

// Negative sampling: one positive pair + k negative samples
function trainStep(centerIdx, contextIdx, k=5, lr=0.05) {
  const posScore = sigmoid(dot(W[centerIdx], C[contextIdx]));
  const posLoss  = -Math.log(posScore+1e-10);

  // Positive gradient
  const dCenter = C[contextIdx].map(v => (posScore-1)*v);
  const dCtx    = W[centerIdx].map(v => (posScore-1)*v);

  // Negative samples
  for (let n=0; n<k; n++) {
    let negIdx = Math.floor(Math.random()*V);
    while (negIdx===contextIdx) negIdx=Math.floor(Math.random()*V);
    const negScore = sigmoid(dot(W[centerIdx], C[negIdx]));
    dCenter.forEach((_,d) => { dCenter[d] += negScore*C[negIdx][d]; });
    C[negIdx].forEach((_,d) => { C[negIdx][d] -= lr*negScore*W[centerIdx][d]; });
  }

  W[centerIdx].forEach((_,d) => { W[centerIdx][d] -= lr*dCenter[d]; });
  C[contextIdx].forEach((_,d) => { C[contextIdx][d] -= lr*dCtx[d]; });
}

// Train
for (let epoch=0; epoch<800; epoch++)
  for (const [c,ctx] of pairs) trainStep(c, ctx);

// Evaluate geometric properties
console.log('Cosine similarity after training:\n');
const check = [
  ['king','queen'], ['king','man'], ['man','woman'],
  ['dog','cat'],    ['dog','animal'], ['king','dog'],
];
check.forEach(([a,b]) => {
  const sim = cosineSim(W[VOCAB.indexOf(a)], W[VOCAB.indexOf(b)]);
  const bar = '█'.repeat(Math.max(0,Math.round((sim+1)/2*20)));
  console.log(`  sim("${a}","${b}") = ${sim.toFixed(3).padStart(7)}  ${bar}`);
});

// Analogy: king - man + woman ≈ ?
const king=W[0], man=W[2], woman=W[3];
const target = king.map((_,i)=>king[i]-man[i]+woman[i]);
const scores = VOCAB.map((w,i)=>({w, s:cosineSim(target,W[i])}))
  .filter(x=>!['king','man','woman'].includes(x.w))
  .sort((a,b)=>b.s-a.s);
console.log(`\nAnalogy: king − man + woman ≈ "${scores[0].w}" (sim=${scores[0].s.toFixed(3)})`);
console.log('(Expected: "queen")');

2014

§09

2014Sequence-to-sequence & the bottleneck

Sutskever, Vinyals, and Le (2014) combined LSTMs and word embeddings into a general architecture for sequence-to-sequence tasks: an encoder LSTM reads the source sequence into a fixed-size vector, and a decoder LSTM generates the target sequence from that vector, one token at a time.

Applied to machine translation, this was state of the art. It was also the architecture that powered early neural machine translation at Google. But it had a structural flaw that became increasingly apparent at longer sequences: the entire source sentence was compressed into a single fixed-size vector before decoding began. Information was lost. Sentences of 30+ words degraded noticeably.

The bottleneck — longer sentences degrade:

seq2seq.js — encoder/decoder with visible bottleneck JS

// Seq2seq: encode sequence → fixed vector → decode.
// We measure information loss as input length grows.

function tanh(x) { return Math.tanh(x); }

// Minimal RNN encoder: returns final hidden state
function encode(tokens, Wh, Wx, b, hidSize) {
  let h = new Array(hidSize).fill(0);
  for (const x of tokens) {
    // h_t = tanh(Wh·h + Wx·x + b)
    const newH = Array.from({length:hidSize}, (_,i) =>
      tanh(Wh[i].reduce((s,w,j)=>s+w*h[j],0) +
           Wx[i].reduce((s,w,j)=>s+w*x[j],0) + b[i])
    );
    h = newH;
  }
  return h;
}

// Measure how much information from position t survives in final state
// Proxy: correlation between early token representation and final hidden state
function infoSurvival(seqLen, hidSize=4) {
  const r = () => (Math.random()-0.5)*0.2;
  const Wh = Array.from({length:hidSize},()=>Array.from({length:hidSize},r));
  const Wx = Array.from({length:hidSize},()=>Array.from({length:hidSize},r));
  const b  = new Array(hidSize).fill(0);

  // Generate a random sequence of one-hot vectors
  const seq = Array.from({length:seqLen},()=>{
    const v=new Array(hidSize).fill(0); v[Math.floor(Math.random()*hidSize)]=1; return v;
  });

  // Get final encoding
  const hFinal = encode(seq, Wh, Wx, b, hidSize);

  // For each position, get the hidden state at that point
  const survivals = [];
  let h = new Array(hidSize).fill(0);
  for (let t=0; t<seqLen; t++) {
    h = Array.from({length:hidSize},(_,i)=>
      tanh(Wh[i].reduce((s,w,j)=>s+w*h[j],0)+Wx[i].reduce((s,w,j)=>s+w*seq[t][j],0)+b[i])
    );
    // Cosine similarity between intermediate state and final state
    const dot=h.reduce((s,v,i)=>s+v*hFinal[i],0);
    const na=Math.sqrt(h.reduce((s,v)=>s+v*v,0));
    const nb=Math.sqrt(hFinal.reduce((s,v)=>s+v*v,0));
    survivals.push(Math.abs(dot/(na*nb+1e-10)));
  }
  return survivals;
}

console.log('Information survival from each position to final hidden state:\n');
console.log('Position:  ' + Array.from({length:12},(_,i)=>String(i+1).padStart(5)).join(''));
[4,8,12].forEach(len => {
  const surv = infoSurvival(len);
  // Show survival for all positions
  const vals = surv.map(v=>v.toFixed(2).padStart(5)).join('');
  const pad  = '     '.repeat(12-len);
  console.log(`SeqLen=${len}: ` + vals + pad);
});

console.log('\nThe further a token is from the end, the less it survives.');
console.log('At position 1 in a 12-token sequence, signal is severely degraded.');
console.log('\nThis is the fundamental bottleneck of seq2seq: one vector, all history.');

2015

§10

2015Attention: a soft, learnable lookup the real breakthrough

Bahdanau, Cho, and Bengio (2015) made a simple but profound observation: instead of forcing the decoder to work from one fixed vector, let it look back at all of the encoder's hidden states, weighted by relevance. The weights are computed by a small learned function that scores the compatibility between the current decoder state and each encoder state.

This is attention — a soft, differentiable lookup. At each decoding step, the model computes a weighted combination of all encoder states. The weights are normalized by softmax, so they form a probability distribution over the source sequence: "how much should I attend to position 1? position 2? ..." The whole thing is differentiable end-to-end, so it can be trained with backpropagation.

The impact was immediate and dramatic. Long sentences no longer degraded. Translating a 30-word sentence, the model could attend directly to the relevant source words when generating each target word — not through a bottleneck but through a direct weighted query. Researchers visualized the attention weights as heatmaps and found they corresponded to linguistic alignment: French "la" attends to English "the", verbs attend to their subjects.

attention_bahdanau.js — original additive attention JS

// Bahdanau attention (2015) — the original formulation.
// score(s, h_j) = v^T · tanh(W_s·s + W_h·h_j)
// where s is decoder state and h_j is encoder state at position j.
// context = Σ_j softmax(score_j) · h_j

function tanh(x) { return Math.tanh(x); }
function sigmoid(x) { return 1/(1+Math.exp(-x)); }

function softmax(arr) {
  const max = Math.max(...arr);
  const ex  = arr.map(x=>Math.exp(x-max));
  const sum = ex.reduce((a,b)=>a+b,0);
  return ex.map(x=>x/sum);
}

// Additive attention score
function attentionScore(s, h_j, Ws, Wh, v) {
  const D = s.length;
  // e_j = vᵀ · tanh(Ws·s + Wh·h_j)
  const combined = Array.from({length:D},(_,i)=>
    Ws[i].reduce((sum,w,k)=>sum+w*s[k],0) +
    Wh[i].reduce((sum,w,k)=>sum+w*h_j[k],0)
  );
  const activated = combined.map(tanh);
  return v.reduce((sum,vi,i)=>sum+vi*activated[i],0);
}

function attentionContext(s, encoderStates, Ws, Wh, v) {
  const scores  = encoderStates.map(h_j => attentionScore(s, h_j, Ws, Wh, v));
  const weights = softmax(scores);
  // context = weighted sum of encoder states
  const D = encoderStates[0].length;
  const context = Array.from({length:D},(_,i)=>
    weights.reduce((sum,w,j)=>sum+w*encoderStates[j][i],0)
  );
  return { context, weights };
}

// ── Demo: translating "the cat sat" ──────────────────────────────
// Simulated encoder hidden states for "the | cat | sat"
const encoderStates = [
  [ 0.7, -0.3, 0.1, 0.5],  // "the"
  [-0.2,  0.8, 0.6,-0.1],  // "cat"
  [-0.5,  0.1,-0.3, 0.9],  // "sat"
];
const srcWords = ['the', 'cat', 'sat'];
const tgtWords = ['le', 'chat', 'a assis'];

// Random attention parameters
const D = 4;
const Ws = Array.from({length:D},()=>Array.from({length:D},()=>(Math.random()-.5)*.3));
const Wh = Array.from({length:D},()=>Array.from({length:D},()=>(Math.random()-.5)*.3));
const v  = Array.from({length:D},()=>(Math.random()-.5)*.3);

// Simulate three decoder steps, each with a different query state
const decoderStates = [
  [ 0.1, 0.2, 0.3,-0.1],  // when generating "le"
  [-0.1, 0.7, 0.4, 0.2],  // when generating "chat"
  [-0.4, 0.2,-0.2, 0.8],  // when generating "a assis"
];

console.log('Attention weights at each decoder step:\n');
console.log('Generating  | ' + srcWords.map(w=>w.padStart(8)).join(' '));
console.log('─'.repeat(44));
decoderStates.forEach((s, t) => {
  const { weights } = attentionContext(s, encoderStates, Ws, Wh, v);
  const bars = weights.map(w=>'█'.repeat(Math.round(w*12)).padEnd(12));
  const pcts = weights.map(w=>`${(w*100).toFixed(0)}%`.padStart(8));
  console.log(`"${tgtWords[t]}".padEnd(11) |${pcts.join(' ')}`);
  console.log(`            |${bars.map(b=>b.padStart(9)).join(' ')}`);
});

console.log('\nEach row shows which source tokens the decoder attends to.');
console.log('The weights are learned — no manual alignment is specified.');

Bahdanau attention solved the seq2seq bottleneck — but it was still attached to an RNN. The encoder still processed tokens sequentially; the decoder still stepped through one token at a time. There was still an O(n) sequential dependency. Parallelism was limited. Then two years later, someone asked a harder question.

2017

§11

2017Removing recurrence entirely attention is all you need

If attention can connect any two positions in a sequence directly — bypassing the bottleneck — why not use attention to build the entire representation, without any recurrence at all?

This is the question Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin asked in their 2017 paper Attention Is All You Need. Their answer — the Transformer — replaced every recurrent and convolutional component with attention and feedforward layers. Every token attends to every other token at every layer, in parallel.

The consequences were enormous:

Parallelism: all positions are computed simultaneously — GPU utilization goes from 30% to near 100%.
No vanishing gradient: attention creates direct connections between distant positions — gradient highway at every layer.
Scalability: the architecture trains efficiently on modern hardware, making very large models feasible.

The paper also introduced scaled dot-product attention — simpler than Bahdanau's additive formulation, and faster to compute on GPUs. Three weight matrices (Q, K, V) replace the alignment network. The scaling by 1/√dₖ keeps the dot products well-behaved in high dimensions. The reason Bahdanau used an additive form (v·tanh(Ws·s + Wh·h)) rather than a dot product is that the decoder state s and encoder state h live in different vector spaces with potentially different dimensions — you cannot take a meaningful dot product between them. The Transformer avoids this by projecting queries and keys into the same dₖ-dimensional space first, making the dot product both valid and efficient.

The evolutionary path — click to compare:

compare.js — RNN vs Transformer: operation counts JS

// Comparing RNN and Transformer on key metrics.
// Shows why the Transformer became the dominant architecture.

function rnnComplexity(n, d) {
  return {
    sequentialOps: n,            // must process tokens 1..n in order
    parallelOps: 1,              // only one token at a time
    pathLength: n,               // to connect position 1 to position n
    params: 4 * d * d,           // W_hh, W_xh, W_hy, W_ih
    memoryPerToken: d,           // one hidden state vector
  };
}

function transformerComplexity(n, d, h) {
  return {
    sequentialOps: 1,            // all positions computed in parallel
    parallelOps: n,              // every token processed simultaneously
    pathLength: 1,               // direct attention between any positions
    params: 4 * d * d,           // W_q, W_k, W_v, W_o (approximately)
    memoryPerToken: n * d,       // must store all K,V pairs
  };
}

const configs = [
  { n:  64, d: 256, h: 8, label: 'Small (n=64,  d=256)' },
  { n: 512, d: 512, h: 8, label: 'Medium(n=512, d=512)' },
  { n:2048, d:1024, h:16, label: 'Large (n=2048,d=1024)'},
];

console.log('RNN vs Transformer — key complexity metrics:\n');
for (const {n, d, h, label} of configs) {
  const rnn = rnnComplexity(n, d);
  const tfm = transformerComplexity(n, d, h);
  console.log(`=== ${label} ===`);
  console.log(`  Sequential operations:  RNN=${rnn.sequentialOps}  Transformer=${tfm.sequentialOps}`);
  console.log(`  Parallel ops possible:  RNN=${rnn.parallelOps}     Transformer=${tfm.parallelOps}`);
  console.log(`  Max path length:        RNN=${rnn.pathLength}     Transformer=${tfm.pathLength}`);
  const speedup = rnn.sequentialOps / tfm.sequentialOps;
  console.log(`  Parallelism speedup:    ${speedup}×`);
  console.log();
}

console.log('The Transformer reduces sequential operations from O(n) to O(1).');
console.log('This single change made training on modern GPUs orders of magnitude faster.');
console.log('It is why scaling to 100B+ parameters became feasible.');

The Transformer was not the end of the story. GPT-1 (2018) showed that pretraining a Transformer on large corpora transferred powerfully to downstream tasks. BERT (2018) showed that bidirectional pretraining was better for understanding tasks. GPT-3 (2020) showed that scale alone produced capabilities that seemed qualitatively new. The details continued to evolve — positional encodings, normalization placement, activation functions — but the fundamental architecture of 2017 persists at the core of every large language model in production today.

Where Part 1 begins: the Transformer architecture — tokens, embeddings, multi-head attention, feedforward layers, the full forward pass — is covered in detail in Part 1: Attention & the Transformer. Part 0 gave you the "why". Part 1 gives you the "how".

→ Part 1: Attention & the Transformer — embeddings, multi-head attention, positional encoding, the full forward pass.

The road toTransformers

1948Shannon & the statistics of language

1950sN-gram models first working LM

1958The Perceptron and its limits

1986Backpropagation & the revival key unlock

1990sRecurrent networks & sequences

1991The vanishing gradient problem

1997LSTMs: memory with gates solves vanishing gradient

2013Word embeddings & meaning as geometry semantic geometry

2014Sequence-to-sequence & the bottleneck

2015Attention: a soft, learnable lookup the real breakthrough

2017Removing recurrence entirely attention is all you need

The road to
Transformers