Part Seven · Chapter 22 · 2 of 3

Attention & the
Transformer

How text becomes numbers, numbers attend to one another, and a stack of simple matrix operations learns to model language.

Contents

§01 What is a language model?
§02 Tokenization & vocabulary
§03 Embeddings
§04 Matrix operations from scratch
§05 The attention mechanism
§06 Multi-head attention
§07 Positional encoding
§08 The Transformer block
§09 Stacking layers
§10 The full forward pass

§01

What is a language model?

Strip away everything — the chat interface, the RLHF tuning (an alignment step covered in Part 2), the trillion parameters — and a language model is a single function: given a sequence of tokens, produce a probability distribution over the vocabulary for the next token.

That's it. P(next token | all previous tokens). The remarkable fact is that training this function on enough text, at enough scale, causes it to develop internal representations that look like knowledge, reasoning, and language understanding — as emergent side-effects.

Before we can understand how that function is computed, we need to understand how text becomes numbers, because neural networks operate entirely in the realm of vectors and matrices.

Interactive — predict the next token

A tiny bigram model trained on character frequencies. Not a real LLM — but the same predict-the-next-token loop that underlies GPT-4.

bigram.js — a minimal language model JS

// A bigram character model: P(next_char | prev_char)
// This is the simplest possible "language model" —
// each character prediction depends only on the one before it.

function buildBigram(text) {
  const counts = {};
  for (let i = 0; i < text.length - 1; i++) {
    const key = text[i];
    if (!counts[key]) counts[key] = {};
    counts[key][text[i+1]] = (counts[key][text[i+1]] || 0) + 1;
  }
  // convert counts to probabilities
  const model = {};
  for (const ch in counts) {
    const total = Object.values(counts[ch]).reduce((a,b) => a+b, 0);
    model[ch] = {};
    for (const next in counts[ch])
      model[ch][next] = counts[ch][next] / total;
  }
  return model;
}

function sample(dist) {
  const r = Math.random();
  let cumul = 0;
  for (const [ch, p] of Object.entries(dist)) {
    cumul += p;
    if (r < cumul) return ch;
  }
  return Object.keys(dist)[0];
}

function generate(model, seed, length=80) {
  let out = seed;
  let ch = seed[seed.length - 1];
  for (let i = 0; i < length; i++) {
    if (!model[ch]) break;
    ch = sample(model[ch]);
    out += ch;
  }
  return out;
}

const text = `the cat sat on the mat the cat ate the rat the rat ran from the cat`;
const model = buildBigram(text);
console.log("Bigram model trained. Top predictions after 't':");
const afterT = Object.entries(model['t'] || {})
  .sort((a,b) => b[1]-a[1]).slice(0,5);
afterT.forEach(([ch, p]) => console.log(`  P(${JSON.stringify(ch)} | 't') = ${p.toFixed(3)}`));
console.log("\nGenerated text:");
console.log(generate(model, "the ", 100));

This bigram model captures almost nothing of real language. A real LLM looks at the entire context — hundreds of thousands of tokens — and its predictions involve hundreds of billions of parameters. But the interface is identical: a probability distribution over the vocabulary.

§02

Tokenization & vocabulary

Before text can enter a model, it must become a sequence of integers. A tokenizer splits text into tokens — sub-word units — and maps each to an ID in a fixed vocabulary.

Why not characters? A character-level model must work too hard: the string "unbelievable" is 12 steps for a character model but a single token for a well-trained tokenizer. Why not words? Vocabulary would explode to millions of entries across all languages, inflections, and compounds, and unseen words become impossible to handle.

The standard approach is Byte-Pair Encoding (BPE): start with individual bytes, then repeatedly merge the most frequent adjacent pair into a new token. Run this 50,000 times and you get a vocabulary that covers common words whole and breaks rare words into recognizable sub-word pieces.

Tokenizer demo — enter text to see tokens

bpe.js — Byte-Pair Encoding from scratch JS

// BPE tokenizer — trained on a tiny corpus.
// In production (GPT-2/4) the same algorithm runs on ~100GB of text.

function getBPEPairs(word) {
  const pairs = new Set();
  for (let i = 0; i < word.length - 1; i++)
    pairs.add(word[i] + '|' + word[i+1]);
  return pairs;
}

function countPairs(corpus) {
  const freq = {};
  for (const word of corpus) {
    const tokens = word.split(' ');
    for (let i = 0; i < tokens.length - 1; i++) {
      const pair = tokens[i] + '|' + tokens[i+1];
      freq[pair] = (freq[pair] || 0) + 1;
    }
  }
  return freq;
}

function mergePair(corpus, pair) {
  const [a, b] = pair.split('|');
  return corpus.map(word =>
    word.split(' ').reduce((out, tok, i, arr) => {
      if (tok === a && arr[i+1] === b) {
        out.push(a + b); out._skip = true;
      } else if (!out._skip) {
        out.push(tok);
      } else {
        delete out._skip;
      }
      return out;
    }, []).join(' ')
  );
}

function trainBPE(words, numMerges) {
  // Start: every character is its own token, marked with space-boundaries
  let corpus = words.map(w => w.split('').join(' ') + ' ');
  const merges = [];
  for (let i = 0; i < numMerges; i++) {
    const freq = countPairs(corpus);
    if (Object.keys(freq).length === 0) break;
    const best = Object.entries(freq).sort((a,b) => b[1]-a[1])[0][0];
    corpus = mergePair(corpus, best);
    merges.push(best.replace('|', ' + '));
  }
  return { corpus, merges };
}

const words = ['low','lower','newest','widest','low','low','lower','newer'];
const { corpus, merges } = trainBPE(words, 8);
console.log('Merge operations learned:');
merges.forEach((m, i) => console.log(`  ${i+1}. merge: ${m}`));
console.log('\nFinal tokenized corpus:');
corpus.forEach((w, i) => console.log(`  "${words[i]}" → [${w}]`));

GPT-4 uses a vocabulary of ~100,000 tokens. The tokenizer is trained once, frozen, and shared between training and inference. A critical practical consequence: the number of tokens in a string depends heavily on the language — English is very efficient; programming languages and non-Latin scripts are less so.

§03

Embeddings

A token ID is just an integer — a lookup index. It carries no information about meaning. Embeddings convert each token ID into a dense vector of real numbers, typically 768 to 12,288 dimensions depending on model size.

The embedding table is a matrix E of shape [vocab_size × d_model]. To embed token 42, you simply retrieve row 42 of E. These values start random and are learned during training to encode semantic relationships.

The remarkable property that emerges: similar meanings end up nearby in this high-dimensional space. The famous result king − man + woman ≈ queen is not programmed in; it emerges from the geometry of embeddings trained with a shallow word-prediction objective (word2vec). In LLMs, the embedding table is trained end-to-end alongside all other parameters, so the geometry is richer and less cleanly arithmetic — but the core idea holds: proximity in embedding space reflects semantic similarity.

embeddings.js — lookup table and cosine similarity JS

// Embedding table: integer token IDs → dense vectors.

function makeMatrix(rows, cols, initFn) {
  return Array.from({length: rows}, (_, i) =>
    Array.from({length: cols}, (_, j) => initFn(i, j))
  );
}

// Tiny example: 8-token vocabulary, 4-dimensional embeddings
const VOCAB_SIZE = 8;
const D_MODEL    = 4;

// In reality these are learned. Here we use hand-crafted values
// to demonstrate the geometry.
const embeddingTable = [
  // token:  [dim0,   dim1,  dim2,  dim3]
  /* 0 the */  [ 0.9,  0.1, -0.2,  0.0],
  /* 1 cat */  [ 0.2,  0.8,  0.7, -0.1],
  /* 2 dog */  [ 0.2,  0.7,  0.8,  0.1],
  /* 3 sat */  [-0.3,  0.1,  0.2,  0.9],
  /* 4 ran */  [-0.4,  0.1,  0.1,  0.8],
  /* 5 on  */  [ 0.8,  0.0, -0.3,  0.1],
  /* 6 mat */  [ 0.3, -0.1, -0.2,  0.7],
  /* 7 fast*/  [-0.5,  0.2,  0.0,  0.6],
];
const vocab = ['the','cat','dog','sat','ran','on','mat','fast'];

function embed(tokenId) {
  return embeddingTable[tokenId];
}

function dot(a, b) {
  return a.reduce((s, v, i) => s + v * b[i], 0);
}

function norm(v) {
  return Math.sqrt(v.reduce((s, x) => s + x*x, 0));
}

function cosineSim(a, b) {
  return dot(a, b) / (norm(a) * norm(b));
}

// Compare pairs
const pairs = [
  ['cat', 'dog'],   // semantically similar (both animals)
  ['sat', 'ran'],   // semantically similar (both verbs)
  ['cat', 'sat'],   // different POS
  ['the', 'on'],    // both function words
];

console.log('Cosine similarity between token embeddings:');
console.log('(higher = more similar in embedding space)\n');
for (const [a, b] of pairs) {
  const ia = vocab.indexOf(a), ib = vocab.indexOf(b);
  const sim = cosineSim(embed(ia), embed(ib));
  console.log(`  sim("${a}", "${b}") = ${sim.toFixed(4)}`);
}

// Embed a sentence
const sentence = 'the cat sat';
const tokens = sentence.split(' ').map(w => vocab.indexOf(w));
console.log('\nEmbedding "the cat sat":');
tokens.forEach(id => {
  console.log(`  token ${id} ("${vocab[id]}") → [${embed(id).map(x=>x.toFixed(2)).join(', ')}]`);
});

Key point: After embedding, a sentence of n tokens becomes a matrix of shape [n × d_model]. All subsequent transformer operations work on this matrix. The model never sees characters or words — only these vectors.

§04

Matrix operations from scratch

The Transformer is, at its core, a sequence of matrix multiplications, additions, and nonlinearities. Before we can understand attention, we need these primitives in place.

Here is our complete linear algebra library — 30 lines of JavaScript that are sufficient to implement the entire forward pass:

linalg.js — the full math library (30 lines) JS

// Linear algebra.
// This 30-line file is sufficient for the entire Transformer forward pass.

const LA = {
  // Create matrix filled with zeros
  zeros: (r, c) => Array.from({length:r}, () => new Array(c).fill(0)),

  // Create matrix with random values from N(0, scale)
  randn: (r, c, scale=0.02) => Array.from({length:r}, () =>
    Array.from({length:c}, () => {
      // Box-Muller transform for Gaussian random
      const u = 1 - Math.random(), v = Math.random();
      return Math.sqrt(-2*Math.log(u)) * Math.cos(2*Math.PI*v) * scale;
    })
  ),

  // Matrix multiply: A[m×k] × B[k×n] → C[m×n]
  matmul: (A, B) => {
    const [m,k,n] = [A.length, B.length, B[0].length];
    const C = LA.zeros(m, n);
    for (let i=0;i<m;i++) for (let j=0;j<n;j++)
      for (let p=0;p<k;p++) C[i][j] += A[i][p] * B[p][j];
    return C;
  },

  // Transpose: A[m×n] → Aᵀ[n×m]
  T: A => A[0].map((_, j) => A.map(row => row[j])),

  // Elementwise add (broadcast if b is 1D vector)
  add: (A, b) => A.map((row, i) =>
    row.map((v, j) => v + (Array.isArray(b[0]) ? b[i][j] : b[j]))
  ),

  // Scale every element
  scale: (A, s) => A.map(row => row.map(v => v * s)),

  // Softmax along last axis (each row independently)
  softmax: A => A.map(row => {
    const max = Math.max(...row); // numerical stability
    const ex = row.map(x => Math.exp(x - max));
    const s  = ex.reduce((a,b) => a+b, 0);
    return ex.map(x => x/s);
  }),

  // Layer norm: normalize each row to mean=0, std=1, then scale+shift
  layernorm: (A, gamma, beta, eps=1e-5) => A.map(row => {
    const mean = row.reduce((a,b)=>a+b,0) / row.length;
    const variance = row.reduce((a,b)=>a+(b-mean)**2,0) / row.length;
    const std = Math.sqrt(variance + eps);
    return row.map((x,i) => gamma[i] * (x-mean)/std + beta[i]);
  }),

  // GELU activation (approximate)
  gelu: A => A.map(row =>
    row.map(x => 0.5 * x * (1 + Math.tanh(Math.sqrt(2/Math.PI)*(x+0.044715*x**3))))
  ),
};

// ── Tests ──────────────────────────────────────────────────────────
const A = [[1,2],[3,4]];
const B = [[5,6],[7,8]];
console.log('A × B =', LA.matmul(A, B));
// Expected: [[19,22],[43,50]]

console.log('Aᵀ =', LA.T(A));
// Expected: [[1,3],[2,4]]

console.log('softmax([[1,2,3]]) =',
  LA.softmax([[1,2,3]]).map(row => row.map(x=>x.toFixed(4))));
// ~[0.09, 0.24, 0.67]

const gamma = [1,1,1], beta = [0,0,0];
console.log('layernorm([[1,2,3]]) =',
  LA.layernorm([[1,2,3]], gamma, beta).map(r=>r.map(x=>x.toFixed(4))));

From here on, LA is our complete dependency. Every operation in the Transformer — attention, feedforward, layer norm — will be built from these primitives.

§05

The attention mechanism

Attention is the mechanism that lets every token look at every other token and decide how much to "attend" to each. It is the core innovation of the Transformer — the thing that replaced recurrence and made parallel training over long contexts possible.

Queries, Keys, and Values

For each token in the sequence, attention computes three vectors by multiplying the embedding by learned weight matrices:

Query (Q) — "what am I looking for?"
Key (K) — "what do I contain?"
Value (V) — "what do I give, if attended to?"

The attention score between token i (querying) and token j (being attended to) is the dot product of their query and key vectors: score(i,j) = Q[i] · K[j]. The dot product is the right primitive here because it measures geometric alignment — two vectors with large dot product are pointing in similar directions. After training, tokens that are semantically compatible in the query/key sense end up with weight vectors pointing in similar directions, so the dot product naturally becomes high for relevant pairs and low for irrelevant ones. The output for token i is then the weighted sum of all value vectors, with weights from the softmax of scores.

The full formula for one attention head:

Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V

The √d_k term is a scaling factor to prevent the dot products from growing too large in high-dimensional spaces, which would push softmax into a saturated, near-zero-gradient region. Here is why high-dimensional dot products tend to be large: if Q and K are vectors whose components each have mean 0 and variance 1, then Q·K = Σ QᵢKᵢ is a sum of d_k independent terms each with variance 1, so the total dot product has variance d_k — it grows with dimension. Dividing by √d_k standardises the variance back to 1 regardless of how large d_k gets, keeping the softmax inputs in a range where the output is spread across many tokens rather than collapsing to a near-one-hot spike.

Sentence: the cat sat on the mat

attention.js — scaled dot-product attention JS

// Scaled dot-product attention — the heart of every Transformer.
// Input X: [seq_len × d_model]
// Wq, Wk, Wv: learned weight matrices [d_model × d_k]
// causal: if true, apply lower-triangular mask (GPT/decoder style)
// Returns: output [seq_len × d_v]

function scaledDotProductAttention(Q, K, V, causal = false) {
  const dk = K[0].length;
  const seq_len = Q.length;

  // Step 1: compute raw scores — Q · Kᵀ
  // scores[i][j] = "how much should token i attend to token j?"
  const scores = LA.matmul(Q, LA.T(K));   // [seq_len × seq_len]

  // Step 2: scale by 1/√dk — keeps variance stable across dimensions
  const scaled = LA.scale(scores, 1 / Math.sqrt(dk));

  // Step 3: causal mask — tokens cannot attend to future tokens.
  // Set future positions to -Infinity so softmax gives them weight ≈ 0.
  // This is what makes GPT "autoregressive": at generation time token i
  // can only see tokens 0…i, and training must match that constraint.
  if (causal) {
    for (let i = 0; i < seq_len; i++)
      for (let j = i + 1; j < seq_len; j++)
        scaled[i][j] = -Infinity;
  }

  // Step 4: softmax over each row — convert scores to probabilities
  const weights = LA.softmax(scaled);     // [seq_len × seq_len]

  // Step 5: weighted sum of values
  const output = LA.matmul(weights, V);   // [seq_len × d_v]

  return { output, weights };
}

// ── Example: 4-token sequence, d_model=4, d_k=3 ───────────────────
const seq_len=4, d_model=4, d_k=3;

// Simulated embeddings for "the cat sat on"
const X = [
  [ 0.9, 0.1, -0.2,  0.0],  // the
  [ 0.2, 0.8,  0.7, -0.1],  // cat
  [-0.3, 0.1,  0.2,  0.9],  // sat
  [ 0.8, 0.0, -0.3,  0.1],  // on
];

// Random projection weight matrices (would be learned during training)
const Wq = [[ 0.4,-0.2, 0.7],[-0.1, 0.5, 0.3],[ 0.2,-0.4, 0.6],[ 0.1, 0.3,-0.2]];
const Wk = [[ 0.3, 0.6,-0.1],[ 0.5,-0.2, 0.4],[-0.3, 0.1, 0.7],[ 0.2, 0.4,-0.3]];
const Wv = [[ 0.6, 0.1, 0.3,-0.2],[ 0.2, 0.7,-0.1, 0.4],[-0.4, 0.3, 0.5, 0.1],[ 0.1,-0.2, 0.4, 0.6]];

const Q = LA.matmul(X, Wq);
const K = LA.matmul(X, Wk);
const V = LA.matmul(X, Wv);

const tokens = ['the', 'cat', 'sat', 'on'];

// ── Bidirectional (BERT-style encoder): every token sees every other ─
const { output, weights } = scaledDotProductAttention(Q, K, V, false);
console.log('Bidirectional attention weights (encoder / BERT style):');
console.log('         ' + tokens.map(t => t.padStart(7)).join(''));
weights.forEach((row, i) => {
  console.log(`  ${tokens[i].padEnd(6)} ${row.map(w => w.toFixed(3).padStart(7)).join('')}`);
});

// ── Causal (GPT-style decoder): each token sees only past and itself ─
const { weights: cw } = scaledDotProductAttention(Q, K, V, true);
console.log('\nCausal attention weights (decoder / GPT style — lower-triangular):');
console.log('         ' + tokens.map(t => t.padStart(7)).join(''));
cw.forEach((row, i) => {
  console.log(`  ${tokens[i].padEnd(6)} ${row.map(w => isNaN(w) ? '      0' : w.toFixed(3).padStart(7)).join('')}`);
});
console.log('\n"the" attends only to itself. "cat" can see "the" and "cat".');
console.log('"on" can see all four tokens. No token sees into the future.');

Notice that the output vectors are no longer simple embeddings — each one is a contextualized representation that has absorbed information from all other tokens in proportion to the attention weights. "cat" now carries a hint of "sat" and "the". This is what makes Transformers so powerful: every representation is context-aware from the very first layer.

The two weight tables above reveal the two main Transformer variants. Bidirectional attention (BERT-style encoders) lets every token attend to every other — useful when you need a rich representation of a complete input. Causal attention (GPT-style decoders) enforces a strict left-to-right constraint: token i can only see tokens 0…i. This is not just a training convenience — it is architecturally required. At generation time the model produces one token at a time, appending each to the sequence before computing the next. Training with the causal mask ensures the model never leaks information from the future into its predictions, exactly matching how it will be used.

§06

Multi-head attention

A single attention head can only learn one type of relationship at a time. Multi-head attention runs h independent attention heads in parallel, each with its own projection matrices, then concatenates their outputs and projects back to d_model.

Why? Different heads specialize. In practice, heads learn to capture different aspects of language: one may track syntactic subject–verb agreement, another co-reference, another positional proximity. This specialization is not programmed — it emerges from training.

Parameter	Typical GPT-2 small	GPT-4 (est.)
`d_model`	768	~12,288
`n_heads`	12	~96
`d_k = d_model/n_heads`	64	~128
Parameters in MHA	~2.4M	~600M

multihead.js — multi-head attention JS

// Multi-head attention: h parallel attention heads, then concat + project.

function multiHeadAttention(X, Wo, heads) {
  // heads: array of {Wq, Wk, Wv} — one per head
  const headOutputs = heads.map(h => {
    const Q = LA.matmul(X, h.Wq);
    const K = LA.matmul(X, h.Wk);
    const V = LA.matmul(X, h.Wv);
    return scaledDotProductAttention(Q, K, V).output;
  });

  // Concatenate along the feature dimension: [seq_len × (h * d_v)]
  const concat = headOutputs[0].map((_, i) =>
    headOutputs.flatMap(h => h[i])
  );

  // Final projection Wo: [h*d_v × d_model] — mixes head outputs back into d_model
  // Each head saw a different projection; Wo learns how to blend their contributions.
  return LA.matmul(concat, Wo);
}

// ── Example: 2-head attention on a 4-token sequence ───────────────
function scaledDotProductAttention(Q, K, V, causal = false) {
  const dk = K[0].length;
  const seq_len = Q.length;
  const scores = LA.matmul(Q, LA.T(K));
  const scaled = LA.scale(scores, 1 / Math.sqrt(dk));
  if (causal)
    for (let i = 0; i < seq_len; i++)
      for (let j = i + 1; j < seq_len; j++)
        scaled[i][j] = -Infinity;
  const weights = LA.softmax(scaled);
  return { output: LA.matmul(weights, V), weights };
}

const X2 = [
  [ 0.9, 0.1, -0.2,  0.0],  // the
  [ 0.2, 0.8,  0.7, -0.1],  // cat
  [-0.3, 0.1,  0.2,  0.9],  // sat
  [ 0.8, 0.0, -0.3,  0.1],  // on
];

// 2 heads, d_k = d_v = 2 (d_model/n_heads = 4/2)
const heads = [
  {
    Wq: [[.6,-.2],[.1,.5],[.3,.2],[-.1,.4]],
    Wk: [[.5,.3],[-.2,.6],[.4,-.1],[.2,.3]],
    Wv: [[.4,.2],[.3,-.1],[.1,.5],[-.2,.3]],
  },
  {
    Wq: [[-.3,.5],[.4,.1],[.2,-.4],[.6,.2]],
    Wk: [[.2,-.3],[.5,.4],[-.1,.2],[.3,.6]],
    Wv: [[.3,.5],[-.2,.4],[.6,.1],[.2,-.3]],
  }
];

// Wo: [h*d_v × d_model] = [4 × 4] — blends the two heads back into d_model
const Wo = [
  [ 0.4, 0.3,-0.2, 0.1],
  [-0.1, 0.5, 0.3,-0.2],
  [ 0.2,-0.1, 0.6, 0.4],
  [ 0.3, 0.2,-0.3, 0.5],
];

const out = multiHeadAttention(X2, Wo, heads);
const tokens = ['the', 'cat', 'sat', 'on'];
console.log('Multi-head attention output (2 heads, d_model=4, with Wo projection):');
out.forEach((row, i) => {
  console.log(`  ${tokens[i].padEnd(5)} [${row.map(v=>v.toFixed(3)).join(', ')}]`);
});

The output has the same shape as the input — [seq_len × d_model] — because Wo maps the concatenated head outputs back to d_model dimensions. Each head independently extracted a different type of relationship; Wo learns a weighted blend of those findings. The shape-preserving property is what makes it straightforward to stack multiple blocks on top of each other.

§07

Positional encoding

Attention is inherently permutation-invariant — the matrix operations are the same regardless of token order. "The cat sat" and "Cat the sat" would produce identical attention outputs without something extra. That something is positional encoding.

The original Transformer uses sinusoidal encodings: each position gets a unique vector constructed from sine and cosine functions at different frequencies. This is added directly to the token embeddings before the first layer.

Modern models like GPT-2+ use learned positional embeddings (another lookup table, one row per position) or more sophisticated schemes like RoPE (Rotary Position Embedding). The core idea remains: inject position information into the representation before attention operates on it.

Visualize sinusoidal position encodings:

positional.js — sinusoidal encoding JS

// Sinusoidal positional encoding from "Attention Is All You Need" (Vaswani 2017).
// PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
// PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

function sinusoidalPE(seqLen, dModel) {
  return Array.from({length: seqLen}, (_, pos) =>
    Array.from({length: dModel}, (_, i) => {
      const dim = Math.floor(i / 2);
      const angle = pos / Math.pow(10000, 2 * dim / dModel);
      return i % 2 === 0 ? Math.sin(angle) : Math.cos(angle);
    })
  );
}

const seqLen = 6, dModel = 8;
const PE = sinusoidalPE(seqLen, dModel);

console.log(`Positional encodings [${seqLen} positions × ${dModel} dims]:`);
console.log('pos  ' + Array.from({length:dModel},(_,i)=>`d${i}`.padStart(7)).join(''));
PE.forEach((row, pos) => {
  console.log(`  ${pos}  ${row.map(v => v.toFixed(3).padStart(7)).join('')}`);
});

// Key property: positions that are close together have similar encodings
function cosineSim(a, b) {
  const dot = a.reduce((s,v,i)=>s+v*b[i],0);
  const na  = Math.sqrt(a.reduce((s,v)=>s+v*v,0));
  const nb  = Math.sqrt(b.reduce((s,v)=>s+v*v,0));
  return dot/(na*nb);
}
console.log('\nCosine similarity between positional encodings:');
console.log(`  sim(pos0, pos1) = ${cosineSim(PE[0],PE[1]).toFixed(4)}  (adjacent)`);
console.log(`  sim(pos0, pos3) = ${cosineSim(PE[0],PE[3]).toFixed(4)}  (3 apart)`);
console.log(`  sim(pos0, pos5) = ${cosineSim(PE[0],PE[5]).toFixed(4)}  (5 apart)`);

§08

The Transformer block

One Transformer block consists of two sublayers, each wrapped in a residual connection and layer normalization:

Multi-head self-attention — tokens exchange information
Position-wise feedforward network — each token independently processed

The residual connection adds the input back to the sublayer output: output = x + Sublayer(x). This solves two critical problems: it allows gradients to flow cleanly during backpropagation (the additive path carries gradients directly, bypassing the sublayer's non-linearities), and it lets early layers still contribute to the final output even in very deep networks. The original 2017 Transformer paper applied layer normalisation after the addition (post-norm: LayerNorm(x + Sublayer(x))). GPT-2 and most subsequent models moved it before the sublayer (pre-norm: x + Sublayer(LayerNorm(x))) — this turns out to train more stably at large scale because the residual stream itself remains unnormalised, making it easier for the optimizer to navigate in early training.

The feedforward network is two linear layers with a GELU activation in between: FFN(x) = GELU(x·W₁ + b₁)·W₂ + b₂. The inner dimension is typically 4× the model dimension — 3,072 for GPT-2 small. This is where most of the model's parameters live.

transformer_block.js — one complete Transformer block JS

// One complete Transformer block.
// Implements pre-norm variant (used by GPT-2+):
//   y = x + Attention(LayerNorm(x))
//   z = y + FFN(LayerNorm(y))

function transformerBlock(X, params) {
  const { heads, W1, b1, W2, b2, gamma1, beta1, gamma2, beta2 } = params;

  // ── Sublayer 1: Multi-head self-attention ─────────────────────────
  const normed1 = LA.layernorm(X, gamma1, beta1);

  // Compute Q, K, V for all heads, run attention, concatenate
  const attnOutputs = heads.map(h => {
    const Q = LA.matmul(normed1, h.Wq);
    const K = LA.matmul(normed1, h.Wk);
    const V = LA.matmul(normed1, h.Wv);
    const dk = K[0].length;
    const scores  = LA.scale(LA.matmul(Q, LA.T(K)), 1/Math.sqrt(dk));
    const weights = LA.softmax(scores);
    return LA.matmul(weights, V);
  });
  // Concatenate head outputs along feature dimension
  const concat = attnOutputs[0].map((_, i) =>
    attnOutputs.flatMap(h => h[i])
  );
  // Residual connection
  const afterAttn = LA.add(X, concat);

  // ── Sublayer 2: Feedforward network ──────────────────────────────
  const normed2 = LA.layernorm(afterAttn, gamma2, beta2);

  // FFN: x → GELU(x·W1 + b1) → ·W2 + b2
  const hidden = LA.gelu(LA.add(LA.matmul(normed2, W1), b1));
  const ffnOut = LA.add(LA.matmul(hidden, W2), b2);

  // Residual connection
  return LA.add(afterAttn, ffnOut);
}

// ── Minimal example: 3-token sequence, d_model=4, 2 heads, d_ff=8 ─
const seq   = [
  [ 0.9, 0.1, -0.2,  0.0],
  [ 0.2, 0.8,  0.7, -0.1],
  [-0.3, 0.1,  0.2,  0.9],
];
const d = 4, dff = 8;
const ones  = new Array(d).fill(1);
const zeros = new Array(d).fill(0);
const onesFF  = new Array(dff).fill(1);
const zerosFF = new Array(dff).fill(0);

const params = {
  heads: [
    { Wq: LA.randn(d,2), Wk: LA.randn(d,2), Wv: LA.randn(d,2) },
    { Wq: LA.randn(d,2), Wk: LA.randn(d,2), Wv: LA.randn(d,2) },
  ],
  W1:    LA.randn(d, dff),    b1: zerosFF,
  W2:    LA.randn(dff, d),    b2: zeros,
  gamma1: ones,  beta1: zeros,
  gamma2: ones,  beta2: zeros,
};

const output = transformerBlock(seq, params);
const tokens = ['the', 'cat', 'sat'];
console.log('Transformer block output (residual representations):');
output.forEach((row, i) => {
  console.log(`  ${tokens[i].padEnd(5)} [${row.map(v=>v.toFixed(4)).join(', ')}]`);
});
console.log('\nShape preserved: input', seq.length,'×',seq[0].length,
  '→ output', output.length,'×',output[0].length);

Shape invariant: a Transformer block takes in [seq_len × d_model] and returns [seq_len × d_model]. The shape never changes. This is what makes stacking layers trivial — the output of one block is identically shaped as the input to the next.

§09

Stacking layers

A Transformer model is simply N copies of the same block structure stacked vertically. GPT-2 small uses 12 layers; GPT-3 uses 96. Each layer refines the representations produced by the one below.

There is a useful (if informal) interpretation of what different layers learn: early layers tend to capture syntactic features — part of speech, local phrase structure. Middle layers capture semantic relationships. Later layers capture task-relevant abstractions. This stratification is consistent across models and has been verified with probing classifiers, though the picture is more mixed than this clean story suggests.

Model	Layers	d_model	Heads	Parameters
GPT-2 small	12	768	12	117M
GPT-2 XL	48	1600	25	1.5B
GPT-3	96	12,288	96	175B
LLaMA 3 8B	32	4,096	32	8B

stack.js — stacking N transformer blocks JS

// Stacking Transformer blocks. The shape [seq_len × d_model] flows
// unchanged through all N layers.

function makeBlockParams(dModel, nHeads, dFF) {
  const dK = dModel / nHeads;
  const ones  = new Array(dModel).fill(1);
  const zeros = new Array(dModel).fill(0);
  return {
    heads: Array.from({length: nHeads}, () => ({
      Wq: LA.randn(dModel, dK),
      Wk: LA.randn(dModel, dK),
      Wv: LA.randn(dModel, dK),
    })),
    W1:    LA.randn(dModel, dFF),
    b1:    new Array(dFF).fill(0),
    W2:    LA.randn(dFF, dModel),
    b2:    new Array(dModel).fill(0),
    gamma1: ones, beta1: zeros,
    gamma2: ones, beta2: zeros,
  };
}

function transformerStack(X, numLayers, dModel, nHeads, dFF) {
  let h = X;
  for (let layer = 0; layer < numLayers; layer++) {
    const params = makeBlockParams(dModel, nHeads, dFF);
    h = transformerBlock(h, params);
    // In a real model, params are fixed after training —
    // here we reinitialize each time to keep the example self-contained
  }
  return h;
}

// Tiny architecture: 3 layers, d_model=4, 2 heads, d_ff=8
const input = [
  [0.9, 0.1, -0.2, 0.0],  // the
  [0.2, 0.8,  0.7,-0.1],  // cat
  [-0.3,0.1,  0.2, 0.9],  // sat
];

console.log('Input shape:', input.length, '×', input[0].length);
const out3 = transformerStack(input, 3, 4, 2, 8);
console.log('After 3 layers shape:', out3.length, '×', out3[0].length);
console.log('\nFinal hidden states:');
['the','cat','sat'].forEach((tok, i) => {
  console.log(`  ${tok.padEnd(5)} [${out3[i].map(v=>v.toFixed(4)).join(', ')}]`);
});
console.log('\nNote: shape is identical to input — blocks are stackable indefinitely');

§10

The full forward pass

We now have every piece. A complete GPT-style forward pass is:

Tokenize input text → sequence of integer IDs
Embed each token ID → vector via embedding table
Add positional encoding to each embedding vector
Pass the resulting matrix through N Transformer blocks
Apply a final layer normalization
Multiply by the unembedding matrix → logits (raw scores, one per vocabulary token, before any normalisation) over vocabulary
Apply softmax → probability distribution over next token

Step 6 is often called the language model head. The unembedding matrix is typically the transpose of the embedding table — a weight-tying trick that reduces parameters and works well in practice.

gpt_forward.js — complete forward pass JS

// Complete GPT-style forward pass.
// Architecture: tiny toy model — vocab=8, d_model=4, 2 layers, 2 heads.

// ── Reuse LA and transformerBlock from previous sections ──────────

// ── Tiny vocabulary ───────────────────────────────────────────────
const VOCAB  = ['the','cat','dog','sat','ran','on','mat','.'];
const V_SIZE = VOCAB.length;  // 8
const D      = 4;             // d_model
const N_HEADS= 2;
const N_LAYERS=2;
const D_FF   = 8;

// ── Embedding table: V_SIZE × D (normally learned; here fixed) ───
const embTable = [
  [ 0.9, 0.1,-0.2, 0.0],  // the
  [ 0.2, 0.8, 0.7,-0.1],  // cat
  [ 0.2, 0.7, 0.8, 0.1],  // dog
  [-0.3, 0.1, 0.2, 0.9],  // sat
  [-0.4, 0.1, 0.1, 0.8],  // ran
  [ 0.8, 0.0,-0.3, 0.1],  // on
  [ 0.3,-0.1,-0.2, 0.7],  // mat
  [ 0.1, 0.1, 0.1, 0.1],  // .
];

function embed(tokenIds) {
  return tokenIds.map(id => [...embTable[id]]);
}

function sinPE(seqLen, d) {
  return Array.from({length:seqLen},(_,pos)=>
    Array.from({length:d},(_,i)=>{
      const dim = Math.floor(i/2);
      const angle = pos / Math.pow(10000, 2*dim/d);
      return i%2===0 ? Math.sin(angle) : Math.cos(angle);
    })
  );
}

function gptForward(tokens) {
  // 1. Embed
  const ids = tokens.map(t => VOCAB.indexOf(t));
  const embs = embed(ids);

  // 2. Add positional encodings
  const pe   = sinPE(tokens.length, D);
  let   h    = embs.map((row, i) => row.map((v,j) => v + pe[i][j]));

  // 3. N transformer layers
  for (let l = 0; l < N_LAYERS; l++) {
    h = transformerBlock(h, makeBlockParams(D, N_HEADS, D_FF));
  }

  // 4. Final layer norm
  const ones  = new Array(D).fill(1);
  const zeros = new Array(D).fill(0);
  h = LA.layernorm(h, ones, zeros);

  // 5. Language model head: h × Eᵀ → logits [seq_len × vocab_size]
  const logits = LA.matmul(h, LA.T(embTable));

  // 6. Softmax → probabilities for next token at each position
  const probs = LA.softmax(logits);

  return { logits, probs };
}

// ── Run forward pass on "the cat sat on the" ─────────────────────
const prompt = ['the', 'cat', 'sat', 'on', 'the'];
const { probs } = gptForward(prompt);

// Look at predictions after the last token ("the")
const lastProbs = probs[probs.length - 1];
const ranked = VOCAB.map((w,i) => [w, lastProbs[i]])
  .sort((a,b) => b[1]-a[1]);

console.log(`Prompt: "${prompt.join(' ')}"`);
console.log(`\nProbability distribution over next token:`);
ranked.forEach(([w, p]) => {
  const bar = '█'.repeat(Math.round(p * 40));
  console.log(`  ${w.padEnd(5)} ${(p*100).toFixed(1).padStart(5)}%  ${bar}`);
});
console.log(`\n(Weights are random — a trained model would give meaningful probs)`);

What's missing for a real model? This forward pass is structurally correct but uses random weights. A real GPT-2 has the same architecture with the same matrix shapes — just with 117 million carefully trained numbers instead of random ones. Part 2 of this series covers how those numbers get trained.

You have now seen the complete forward pass of a language model, built from 30 lines of linear algebra. The chain is: text → tokens → embeddings → positional encoding → N × (attention + feedforward) → logits → probabilities.

Every LLM — GPT-4, Claude, Gemini — is this same structure, scaled up. More layers, more heads, larger embedding dimensions, trained on more data with better techniques. But the fundamental computation is what you have just read and run.

Next in this series

→ Part 2: Training & Inference — how the random weights above become meaningful ones. Cross-entropy loss, backpropagation, the training loop, sampling strategies, and how RLHF turns a next-token predictor into a conversational assistant.

Attention & theTransformer

What is a language model?

Tokenization & vocabulary

Embeddings

Matrix operations from scratch

The attention mechanism

Queries, Keys, and Values

Multi-head attention

Positional encoding

The Transformer block

Stacking layers

The full forward pass

Attention & the
Transformer