A hands-on guide for programmers. Every concept comes with runnable JavaScript you can read, modify, and execute right here.
GPT does exactly one thing: given a list of tokens, predict the most likely next token. That's it. Repeat that loop, appending each predicted token to the input, and you get text generation.
The code below shows the generation loop in pure JavaScript — no libraries needed. Run it to see how a GPT-style model would repeatedly call itself to produce a sequence.
// GPT generation loop — the core idea // In a real model, predict() is a giant neural network. // Here we stub it with a tiny lookup table. function predict(tokens) { // Stub: returns next token based on last two tokens const bigrams = { "the cat": "sat", "cat sat": "on", "sat on": "the", "on the": "mat", "the mat": ".", "mat .": "[END]", }; const key = tokens.slice(-2).join(" "); return bigrams[key] ?? "[END]"; } function generate(prompt, maxTokens = 10) { let tokens = prompt.split(" "); console.log("Prompt:", tokens.join(" ")); for (let i = 0; i < maxTokens; i++) { const next = predict(tokens); if (next === "[END]") break; tokens.push(next); console.log(` step ${i+1}: predicted "${next}" → "${tokens.join(" ")}"`); } return tokens.join(" "); } generate("the cat");
Each call to predict() is one forward pass through the network. The real GPT-2 does this same thing, just with about 124 million parameters in the small model instead of a lookup table.
Neural networks work with numbers, not strings. A tokenizer splits text into chunks called tokens and maps each to an integer ID. GPT-2 uses a vocabulary of 50,257 tokens.
The algorithm is Byte Pair Encoding (BPE): start with individual bytes, then iteratively merge the most common adjacent pairs. Common words get a single token; rare words get split into subwords.
// Simplified BPE tokenizer (character-level for clarity) // Real GPT uses 50k subword merges — same idea, bigger vocab function bpeTokenize(text, merges) { // Start: every character is its own token let tokens = [...text].map(c => [c]); for (const [a, b] of merges) { const merged = []; let i = 0; while (i < tokens.length) { if (i < tokens.length - 1 && tokens[i].join("") === a && tokens[i+1].join("") === b) { merged.push([...tokens[i], ...tokens[i+1]]); // merge! i += 2; } else { merged.push(tokens[i]); i += 1; } } tokens = merged; } return tokens.map(t => t.join("")); } // Learned merge rules (simplified — real BPE has thousands) const merges = [ ["h", "e"], ["he", "l"], ["hel", "l"], ["hell", "o"], ["w", "o"], ["wo", "r"], ["wor", "l"], ["worl", "d"] ]; const result = bpeTokenize("hello world", merges); console.log("Tokens:", result); console.log("Count:", result.length, "(vs", "hello world".length, "chars)");
Token IDs are just integers — they have no geometry. An embedding table maps each ID to a dense vector of floats (768 numbers in GPT-2 small). This vector is the token's "meaning" according to the model.
GPT learns two embedding tables: one for token identity and one for position. The demo below isolates the token lookup first. In the real model, a same-sized position embedding is then added elementwise, so the final input vector carries both what the token is and where it appears.
The vectors shown in the diagram and in the code snippet below are hand-picked toy data chosen to make the geometry visible. They are not computed from text and they are not copied from a trained GPT checkpoint.
x_i = token_embedding(token_i) + position_embedding(i)// Embedding lookup — the simplest operation in the model // In PyTorch: nn.Embedding(vocab_size, n_embd) // Here: a 2D array we index by token ID function createEmbeddingTable(vocabSize, nEmbd) { // In training, weights start random and are learned. // Here we seed with Math.random for illustration. return Array.from({length: vocabSize}, () => Array.from({length: nEmbd}, () => (Math.random() - 0.5) * 0.02) ); } function dotProduct(a, b) { return a.reduce((sum, v, i) => sum + v * b[i], 0); } function cosineSimilarity(a, b) { const dot = dotProduct(a, b); const normA = Math.sqrt(dotProduct(a, a)); const normB = Math.sqrt(dotProduct(b, b)); return dot / (normA * normB); } // Tiny vocab: 5 tokens, 4-dimensional toy embeddings // These rows are hand-chosen for illustration, not learned from training. const vocab = {"cat": 0, "dog": 1, "king": 2, "queen": 3, "the": 4}; const table = [ [ 0.8, 0.2, -0.5, 0.1], // cat [ 0.7, 0.3, -0.4, 0.2], // dog ← similar to cat [ 0.1, 0.9, 0.6, -0.3], // king [ 0.2, 0.8, 0.5, -0.2], // queen ← similar to king [-0.1, 0.0, 0.0, 0.0], // the ← different ]; // Compute pairwise cosine similarities const words = Object.keys(vocab); for (const w1 of words) { for (const w2 of words) { if (w1 >= w2) continue; const sim = cosineSimilarity(table[vocab[w1]], table[vocab[w2]]); console.log(`sim("${w1}", "${w2}") = ${sim.toFixed(3)}`); } }
Notice cat/dog and king/queen will show high cosine similarity — their vectors point in similar directions. This geometric structure is what lets GPT generalize: tokens with similar meanings cluster together in embedding space.
In a real GPT, an embedding vector is just one learned row of the model's embedding matrix. Training starts with small random values, runs the forward pass to predict the next token, measures the error, then backpropagation sends gradients back through the network to every parameter that contributed to that prediction, including the embedding rows that were looked up. The optimizer nudges those rows a little on each step. After many updates, tokens that play similar roles in prediction tend to end up with similar vectors, which is where the geometry comes from in practice.
Attention is the mechanism by which tokens "look at" each other. For every token, the model computes how much weight to give every preceding token when building its representation.
Each token produces three vectors from its embedding: Query (what am I looking for?), Key (what do I offer?), Value (what do I contribute if selected?). The attention score between two tokens is their query·key dot product.
// Scaled dot-product attention — the core operation // attention(Q, K, V) = softmax(Q·Kᵀ / √d_k) · V function softmax(arr) { const max = Math.max(...arr); // subtract max for numerical stability const exps = arr.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b, 0); return exps.map(x => x / sum); } function dot(a, b) { return a.reduce((s, v, i) => s + v * b[i], 0); } function attention(Q, K, V) { const seqLen = Q.length; const dK = Q[0].length; const scale = Math.sqrt(dK); const output = []; for (let i = 0; i < seqLen; i++) { // Compute raw scores: how much does token i attend to each token j≤i? const scores = []; for (let j = 0; j <= i; j++) { // causal: only past tokens scores.push(dot(Q[i], K[j]) / scale); } const weights = softmax(scores); // normalize to probabilities // Weighted sum of values const out = new Array(V[0].length).fill(0); weights.forEach((w, j) => { V[j].forEach((v, k) => out[k] += w * v); }); output.push(out); console.log(`Token ${i}: weights = [${weights.map(w => w.toFixed(3)).join(", ")}]`); } return output; } // Toy example: 3 tokens, 2-dim Q/K/V const Q = [[1.0, 0.5], [0.2, 0.8], [0.9, 0.1]]; const K = [[0.8, 0.3], [0.5, 0.9], [0.7, 0.4]]; const V = [[1.0, 0.0], [0.0, 1.0], [0.5, 0.5]]; console.log("--- Attention weights per token ---"); attention(Q, K, V);
Multi-head attention runs this same computation H times in parallel, each with different learned Q/K/V projection matrices. Each head learns to attend to different relationships — syntax, coreference, semantics. Their outputs are concatenated and projected back.
Attention alone isn't enough. Each attention output is fed into a small feedforward network (MLP) applied independently per token. This is where the model stores most of its "factual" knowledge. The two operations are wrapped in a block with residual connections and layer normalization.
A feedforward network here just means a tiny neural network applied to one token vector at a time. Attention mixes information between tokens; the feedforward step then takes the mixed vector for a single token, pushes it through a couple of learned matrix multiplies plus a nonlinearity, and rewrites that token into a richer representation. "Feedforward" simply means the data moves straight through those layers in one direction during the forward pass, unlike attention which explicitly looks sideways at other tokens.
// A full pre-layer-norm transformer block in JavaScript // x → LayerNorm → Attention → residual → LayerNorm → MLP → residual function layerNorm(x) { const mean = x.reduce((a, b) => a + b, 0) / x.length; const variance = x.reduce((s, v) => s + (v - mean) ** 2, 0) / x.length; const std = Math.sqrt(variance + 1e-5); // epsilon for numerical stability return x.map(v => (v - mean) / std); // zero mean, unit variance } function gelu(x) { // Gaussian Error Linear Unit — smoother than ReLU return 0.5 * x * (1 + Math.tanh(Math.sqrt(2 / Math.PI) * (x + 0.044715 * x ** 3))); } function linearLayer(x, W, b) { // y = xW + b (simplest neural network layer) return W.map((row, i) => row.reduce((s, w, j) => s + w * x[j], 0) + b[i]); } function mlp(x, W1, b1, W2, b2) { const hidden = linearLayer(x, W1, b1).map(gelu); // expand 4× return linearLayer(hidden, W2, b2); // project back } function blockForwardPass(x, attnOut, W1, b1, W2, b2) { // GPT-2 style pre-LN block: normalize before each sub-layer, then add residual. const ln1 = layerNorm(x); const afterAttn = x.map((v, i) => v + attnOut[i]); const ln2 = layerNorm(afterAttn); const mlpOut = mlp(ln2, W1, b1, W2, b2); const afterMlp = afterAttn.map((v, i) => v + mlpOut[i]); console.log("Input x:", x.map(v => v.toFixed(3))); console.log("LN1(x):", ln1.map(v => v.toFixed(3))); console.log("After attention + residual:", afterAttn.map(v => v.toFixed(3))); console.log("LN2(afterAttn):", ln2.map(v => v.toFixed(3))); console.log("MLP output:", mlpOut.map(v => v.toFixed(3))); console.log("After MLP + residual:", afterMlp.map(v => v.toFixed(3))); return afterMlp; } // Demo: 4-dimensional token representation const x = [0.5, -1.2, 0.8, 0.3]; const attnOut = [0.1, 0.4, -0.2, 0.6]; const W1 = [ [0.5, -0.1, 0.3, 0.2], [-0.3, 0.6, 0.1, 0.0], [0.2, 0.2, -0.4, 0.5], [0.1, -0.5, 0.2, 0.3], [0.4, 0.0, 0.3, -0.2], [-0.2, 0.1, 0.5, 0.4], [0.3, -0.3, 0.2, 0.1], [0.0, 0.4, -0.2, 0.6], ]; const b1 = [0.1, -0.2, 0.0, 0.05, 0.1, -0.1, 0.0, 0.2]; const W2 = [ [0.2, -0.1, 0.3, 0.0, 0.2, -0.2, 0.1, 0.4], [-0.3, 0.4, 0.1, 0.2, 0.0, 0.3, -0.2, 0.1], [0.1, 0.2, -0.4, 0.3, 0.5, -0.1, 0.2, 0.0], [0.4, 0.0, 0.2, -0.3, 0.1, 0.2, 0.3, -0.2], ]; const b2 = [0.05, -0.1, 0.2, 0.0]; blockForwardPass(x, attnOut, W1, b1, W2, b2); // Show GELU vs ReLU comparison console.log("\n--- Activation function comparison ---"); [-2, -1, 0, 1, 2].forEach(v => console.log(`x=${v}: ReLU=${Math.max(0,v).toFixed(3)}, GELU=${gelu(v).toFixed(3)}`) );
After the final transformer block, a linear projection produces one score (logit) per vocabulary token. Softmax turns these into probabilities. Then we sample — we don't just pick the highest probability token every time, because that produces repetitive text.
Two parameters control the sampling distribution: temperature (divide logits before softmax — lower = sharper) and top-k (only sample from the k most likely tokens).
// Softmax + temperature + top-k sampling function softmax(logits, temperature = 1.0) { const scaled = logits.map(l => l / temperature); const max = Math.max(...scaled); const exps = scaled.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b, 0); return exps.map(x => x / sum); } function topKFilter(logits, k) { const sorted = [...logits].sort((a, b) => b - a); const threshold = sorted[k - 1]; return logits.map(l => l >= threshold ? l : -Infinity); } function sample(probs) { let r = Math.random(), cumulative = 0; for (let i = 0; i < probs.length; i++) { cumulative += probs[i]; if (r <= cumulative) return i; } return probs.length - 1; } // Raw logits from the model's final linear layer const vocab = ["cat", "dog", "fish", "bird", "horse"]; const logits = [3.5, 2.8, 0.5, -0.2, 1.1]; // Try different temperatures [0.5, 1.0, 2.0].forEach(temp => { const probs = softmax(logits, temp); console.log(`\ntemperature = ${temp}:`); vocab.forEach((w, i) => console.log(` ${w}: ${(probs[i]*100).toFixed(1)}%`)); }); // Top-k sampling: only sample from top 2 tokens const filtered = topKFilter(logits, 2); const probs = softmax(filtered, 1.0); console.log("\ntop-k=2 sampling:"); vocab.forEach((w, i) => console.log(` ${w}: ${(probs[i]*100).toFixed(1)}%`)); // Sample 10 times to show distribution const counts = new Array(vocab.length).fill(0); for (let i = 0; i < 1000; i++) counts[sample(probs)]++; console.log("\nSampled 1000 times (top-k=2, temp=1):"); vocab.forEach((w, i) => console.log(` ${w}: ${counts[i]} times`));
Training is supervised self-prediction: given tokens 1…N, predict token at position 2…N+1. The target is always just the next token in the training corpus. No human labeling needed — the text itself is the label.
The cross-entropy loss measures how surprised the model is: loss = −log(p(correct_token)). If the model assigns 1% probability to the right token, loss ≈ 4.6. If it assigns 99%, loss ≈ 0.01.
The model ingests context ["the", "cat", "sat"] and outputs a probability over all vocabulary tokens for what comes next.
True next token is "on". Cross-entropy loss penalises the model for assigning low probability to it.
Backpropagation computes the gradient of the loss with respect to every parameter using the chain rule — flowing backward through the computation graph.
Adam optimizer applies gradients using adaptive first- and second-moment estimates, rather than plain w = w − lr × gradient. After millions of steps, weights encode the statistical structure of language.
// Cross-entropy loss and a minimal gradient descent step // No autograd library needed — we'll do it by hand for one parameter function softmax(logits) { const max = Math.max(...logits); const exps = logits.map(x => Math.exp(x - max)); const sum = exps.reduce((a, b) => a + b, 0); return exps.map(x => x / sum); } function crossEntropyLoss(probs, targetIndex) { return -Math.log(probs[targetIndex] + 1e-10); // avoid log(0) } // Gradient of cross-entropy + softmax with respect to logits: // dL/dlogit[i] = probs[i] - (i === target ? 1 : 0) // This elegant result comes from differentiating through both. function gradients(probs, targetIndex) { return probs.map((p, i) => p - (i === targetIndex ? 1 : 0)); } // Simulate 20 gradient steps on a tiny single-layer model const vocab = ["on", "down", "up", "away"]; const targetIdx = 0; // "on" is the correct next word let logits = [0.1, 0.2, 0.15, 0.05]; // random initialization const lr = 0.5; console.log("Training to predict 'on' after 'the cat sat'..."); for (let step = 0; step <= 20; step++) { const probs = softmax(logits); const loss = crossEntropyLoss(probs, targetIdx); if (step % 5 === 0) { console.log(`Step ${step.toString().padStart(2)}: loss=${loss.toFixed(4)}, p("on")=${(probs[0]*100).toFixed(1)}%`); } const grads = gradients(probs, targetIdx); logits = logits.map((l, i) => l - lr * grads[i]); // gradient descent } const finalProbs = softmax(logits); console.log("\nFinal probabilities:"); vocab.forEach((w, i) => console.log(` "${w}": ${(finalProbs[i]*100).toFixed(1)}%`));
Watch how the loss drops and p("on") climbs toward 100% over just 20 steps. Real GPT training runs this over billions of tokens, with Adam instead of vanilla gradient descent, learning rate schedules, and gradient clipping — but the core loop is identical.
A full GPT forward pass: token IDs → embeddings → N × transformer blocks → final layer norm → linear projection → logits → softmax → sample. The code below wires up a tiny GPT-2-style decoder in JavaScript. It follows the same pre-layer-norm block structure used in OpenAI's GPT-2 code and Karpathy's nanoGPT, just at toy scale.
// Tiny GPT-2-style decoder in plain JavaScript
// Reference structure: OpenAI GPT-2 and Karpathy's nanoGPT
// Vocab=8, n_embd=4, n_head=2, n_layer=2, block_size=8
const CFG = { vocabSize: 8, nEmbd: 4, nHead: 2, nLayer: 2, blockSize: 8 };
function randn() { return (Math.random() + Math.random() + Math.random() - 1.5) * 0.5; }
function mat(r, c) { return Array.from({ length: r }, () => Array.from({ length: c }, randn)); }
function vec(n) { return Array.from({ length: n }, randn); }
function zeros(n) { return new Array(n).fill(0); }
function addvec(a, b) { return a.map((v, i) => v + b[i]); }
function dot(a, b) { return a.reduce((s, v, i) => s + v * b[i], 0); }
function matvec(M, x) { return M.map(row => row.reduce((s, w, i) => s + w * x[i], 0)); }
function softmax(xs) {
const max = Math.max(...xs);
const exps = xs.map(x => Math.exp(x - max));
const sum = exps.reduce((a, b) => a + b, 0);
return exps.map(x => x / sum);
}
function layerNorm(x) {
const mean = x.reduce((a, b) => a + b, 0) / x.length;
const variance = x.reduce((s, xi) => s + (xi - mean) ** 2, 0) / x.length;
return x.map(xi => (xi - mean) / Math.sqrt(variance + 1e-5));
}
function gelu(x) {
return 0.5 * x * (1 + Math.tanh(Math.sqrt(2 / Math.PI) * (x + 0.044715 * x ** 3)));
}
const weights = {
wte: mat(CFG.vocabSize, CFG.nEmbd),
wpe: mat(CFG.blockSize, CFG.nEmbd),
blocks: Array.from({ length: CFG.nLayer }, () => ({
Wqkv: mat(CFG.nEmbd * 3, CFG.nEmbd),
Wout: mat(CFG.nEmbd, CFG.nEmbd),
W1: mat(CFG.nEmbd * 4, CFG.nEmbd),
b1: vec(CFG.nEmbd * 4),
W2: mat(CFG.nEmbd, CFG.nEmbd * 4),
b2: vec(CFG.nEmbd),
})),
};
// Weight tying: GPT-2 ties token embeddings and output projection weights.
const lmHead = weights.wte;
function selfAttention(xs, W) {
const dHead = CFG.nEmbd / CFG.nHead;
const qkvs = xs.map(x => matvec(W.Wqkv, x));
return xs.map((_, i) => {
const joinedHeads = zeros(CFG.nEmbd);
for (let h = 0; h < CFG.nHead; h++) {
const s = h * dHead;
const e = s + dHead;
const q = qkvs[i].slice(s, e);
const scores = [];
for (let j = 0; j <= i; j++) {
const k = qkvs[j].slice(CFG.nEmbd + s, CFG.nEmbd + e);
scores.push(dot(q, k) / Math.sqrt(dHead));
}
const attn = softmax(scores);
const headOut = zeros(dHead);
attn.forEach((w, j) => {
const v = qkvs[j].slice(CFG.nEmbd * 2 + s, CFG.nEmbd * 2 + e);
v.forEach((value, k) => { headOut[k] += w * value; });
});
headOut.forEach((value, k) => { joinedHeads[s + k] = value; });
}
return matvec(W.Wout, joinedHeads);
});
}
function mlp(x, W) {
const hidden = matvec(W.W1, x).map((v, i) => gelu(v + W.b1[i]));
return addvec(matvec(W.W2, hidden), W.b2);
}
function transformerBlock(xs, W) {
// GPT-2 / nanoGPT block order: LN -> attention -> residual, then LN -> MLP -> residual.
const attnIn = xs.map(layerNorm);
const attnOut = selfAttention(attnIn, W);
const afterAttn = xs.map((x, i) => addvec(x, attnOut[i]));
const mlpIn = afterAttn.map(layerNorm);
const mlpOut = mlpIn.map(x => mlp(x, W));
return afterAttn.map((x, i) => addvec(x, mlpOut[i]));
}
function forward(tokenIds) {
let xs = tokenIds.map((id, pos) => addvec(weights.wte[id], weights.wpe[pos]));
for (const block of weights.blocks) xs = transformerBlock(xs, block);
const last = layerNorm(xs[xs.length - 1]);
const logits = lmHead.map(row => dot(row, last));
return softmax(logits);
}
const tokenIds = [2, 5, 1];
const probs = forward(tokenIds);
console.log("Input token IDs:", tokenIds);
console.log("Output probabilities over vocab (random weights, untrained):");
probs.forEach((p, i) => console.log(` token ${i}: ${(p * 100).toFixed(2)}%`));
console.log("Predicted next token ID:", probs.indexOf(Math.max(...probs)));
console.log("");
console.log("Note: the structure matches GPT-2-style pre-LN blocks,");
console.log("but the weights are random, so the probabilities are meaningless until training.");
This toy decoder keeps the same high-level GPT-2 pattern: learned token and position embeddings, causal self-attention, pre-layer-norm residual blocks, an MLP, a final layer norm, and a projection to vocabulary logits. Production GPT-2 is larger and trained, but this version now matches the block ordering of the real model instead of a post-layer-norm variant.