A self-hosting compiler is written in the very language it compiles. But to compile itself for the first time, it needs to already exist. This is not a paradox — it is a deliberate engineering act called bootstrapping, with a precise sequence of steps reaching back to a seed.
Suppose you want to write a compiler for a language called Foo, and you want to write it in Foo itself. You write the source code of foocc.foo. Now you need to compile it. But the only compiler for Foo is… the one you just wrote. Which isn't compiled yet.
This appears to be a paradox. It is not — but resolving it requires understanding that a self-hosting compiler exists at two distinct levels of abstraction at the same time: as source code (text you can read and modify) and as a running binary (the thing that does the compiling).
// The bootstrapping problem: a compiler that compiles itself // We simulate the three-stage process that resolves the apparent paradox // Stage 0: the seed compiler, written in some OTHER language (C, assembly, etc.) // It compiles a SUBSET of the target language — just enough to get started function seedCompiler(source) { // Minimal: handles only basic constructs if (source.includes("unsupported_feature")) throw new Error("Seed compiler: feature not implemented"); return { type: "binary_from_seed", source, size: source.length * 4 }; } // Stage 1: the full compiler, written in the TARGET language // This is the source code of the self-hosting compiler const fullCompilerSource = ` class Compiler { parse(source) { /* full parser */ } typeCheck(ast) { /* type checker */ } codegen(ast) { /* code generator */ } optimise(code) { /* optimiser */ } compile(source) { return this.codegen(this.typeCheck(this.parse(source))); } } `.trim(); // Bootstrap step 1: use seed compiler to compile the full compiler source const stage1Binary = seedCompiler(fullCompilerSource); console.log("Stage 0→1: seed compiler produces first binary"); console.log(` Input: full compiler source (${fullCompilerSource.length} chars)`); console.log(` Output: stage-1 binary (${stage1Binary.size} bytes, compiled by SEED)`); // Stage 1 binary: can now compile full source, but was BUILT by the seed compiler // It may have missing optimisations the seed compiler didn't implement function stage1Compile(source) { return { type: "full_compiler_binary", source, size: source.length * 3, emitted: `FULL-COMPILER:${source.length}:OPT2` }; } // Bootstrap step 2: use stage-1 binary to recompile the full compiler source AGAIN const stage2Binary = stage1Compile(fullCompilerSource); console.log("\nStage 1→2: stage-1 compiles the same source again"); console.log(` Output: stage-2 binary (${stage2Binary.size} bytes, compiled by STAGE-1)`); // Bootstrap step 3: compile again with stage-2 binary function stage2Compile(source) { return { type: "full_compiler_binary", source, size: source.length * 3, emitted: `FULL-COMPILER:${source.length}:OPT2` }; } const stage3Binary = stage2Compile(fullCompilerSource); console.log(`Stage 2→3: stage-2 compiles again (same source)`); console.log(` Output: stage-3 binary (${stage3Binary.size} bytes)`); // The key invariant: stage-2 and stage-3 should produce IDENTICAL binaries // because both compiled the same source with the same fully-working compiler // This is the "bit-for-bit reproducible" check used to verify bootstrap correctness const stable = stage2Binary.emitted === stage3Binary.emitted; console.log(`\nBootstrap complete. Stable (stage2 == stage3): ${stable}`); console.log("The seed compiler can now be discarded."); console.log("Future builds use only the self-hosted compiler to compile itself.");
Every self-hosting compiler has been through a bootstrapping sequence. The stages have a precise structure. The key insight is that compiling the same compiler source code with different compiler binaries may produce different output. Stability is achieved when stage n and stage n+1 compile that same compiler source to the same result. In a classic 3-stage bootstrap, that is the stage2 == stage3 check.
// The bootstrap fixpoint: a compiler reaches "stability" when // compile(source, compiler_n) == compile(source, compiler_n+1) // We model this as converging optimisation quality class CompilerVersion { constructor(name, optimisationLevel, features) { this.name = name; this.optLevel = optimisationLevel; // 0-100: how good is the output? this.features = features; // set of language features this can compile } compile(sourceName, requiredFeatures) { const missing = requiredFeatures.filter(f => !this.features.has(f)); if (missing.length) throw new Error(`${this.name} cannot compile ${sourceName}: missing ${missing}`); // The output binary's quality depends on the compiler's optimisation level return new CompilerVersion( `${sourceName} compiled by ${this.name}`, Math.min(100, this.optLevel + 15), // each stage improves a bit new Set(requiredFeatures) ); } } // The full compiler source requires these language features const FULL_FEATURES = ["closures", "generics", "pattern_match", "tail_calls"]; const SOURCE_NAME = "FooCompiler.foo"; // Stage 0: seed compiler — written in C, knows all features but low optimisation const seed = new CompilerVersion("seed (C)", 20, new Set(FULL_FEATURES)); console.log("=== Bootstrap sequence ==="); console.log(`Seed: optLevel=${seed.optLevel}, can compile all features`); let compiler = seed; let prevOptLevel = -1; for (let stage = 1; stage <= 5; stage++) { const next = compiler.compile(SOURCE_NAME, FULL_FEATURES); console.log(`Stage ${stage}: compiled by "${compiler.name}"`); console.log(` Output optLevel: ${next.optLevel} (was ${compiler.optLevel})`); if (next.optLevel === prevOptLevel) { console.log(` *** FIXPOINT reached at stage ${stage} — bootstrap stable ***`); break; } prevOptLevel = next.optLevel; compiler = next; } // Reproducible builds: GCC does 3 stages to verify bootstrap correctness console.log("\n=== GCC bootstrap verification ==="); console.log(" gcc stage1 = gcc_src compiled by system compiler"); console.log(" gcc stage2 = gcc_src compiled by stage1"); console.log(" gcc stage3 = gcc_src compiled by stage2"); console.log(" Verify: diff(stage2, stage3) must be empty (bit-identical)"); console.log(" If different: bug in compiler — output depends on what compiled it");
Most languages treat the compiler as a tool outside the language — a program you run to process source files. Smalltalk took a different position: the compiler is a first-class object inside the running system, written in Smalltalk, that you can inspect, modify, and replace while the system is live.
This is not merely a design preference. It flows from Smalltalk's foundational rule: everything is an object, and every operation is a message send. Integers, strings, booleans, classes, methods, compiled bytecode, the scanner, the parser, the code generator — all objects. All can receive messages. All can be stored in variables, put in collections, passed as arguments.
Compiler class, select the compile:in:notifying:ifFail: method, modify it, and the next method compilation uses your modified compiler — all without restarting, without recompiling, without a build step. The running system repairs itself.// Smalltalk's object model: everything is an object with a class pointer // Even integers have a class (SmallInteger), even classes have a class (Metaclass) class SmalltalkObject { constructor(sqClass, fields = {}) { this.sqClass = sqClass; // every object points to its class this.fields = fields; // named instance variables } send(selector, args = []) { // Method lookup: walk class chain until method found let cls = this.sqClass; while (cls) { if (cls.methods?.[selector]) return cls.methods[selector](args); cls = cls.superclass; } throw new Error(`doesNotUnderstand: #${selector}`); } } // Build a minimal Smalltalk class hierarchy const ProtoObject = { name: "ProtoObject", superclass: null, methods: {} }; const Object_cls = { name: "Object", superclass: ProtoObject, methods: { class: function() { return this; }, // 'class' returns the object's class printString: function() { return "an Object"; } }}; const Integer_cls = { name: "Integer", superclass: Object_cls, methods: { plus: ([other]) => new SmalltalkObject(Integer_cls, { value: 0 }), printString: function() { return "an Integer"; } }}; // A Smalltalk METHOD DICTIONARY: maps selector strings to CompiledMethod objects class MethodDictionary { constructor() { this.methods = new Map(); } at(selector) { return this.methods.get(selector); } at_put(selector, meth) { this.methods.set(selector, meth); } includesKey(selector) { return this.methods.has(selector); } selectors() { return [...this.methods.keys()]; } } // A Smalltalk CLASS is an object too — with its own instance variables class SmalltalkClass extends SmalltalkObject { constructor(name, superclass, instanceVarNames) { super(null); // class pointer set later (it's the metaclass) this.name = name; this.superclass = superclass; this.instanceVarNames = instanceVarNames; // names of slots in instances this.methodDict = new MethodDictionary(); } addMethod(selector, fn) { this.methodDict.at_put(selector, fn); } new(...initArgs) { const inst = new SmalltalkObject(this); this.instanceVarNames.forEach(v => { inst.fields[v] = null; }); if (this.methodDict.includesKey("initialize")) this.methodDict.at("initialize").call(inst, initArgs); return inst; } } // Build some real classes const Point = new SmalltalkClass("Point", null, ["x", "y"]); Point.addMethod("initialize", function([x,y]) { this.fields.x = x ?? 0; this.fields.y = y ?? 0; }); Point.addMethod("printString", function() { return `(${this.fields.x}@${this.fields.y})`; }); Point.addMethod("+", function([other]) { return Point.new(this.fields.x+other.fields.x, this.fields.y+other.fields.y); }); const p1 = Point.new(3,4), p2 = Point.new(1,2); const p3 = p1.fields.x !== undefined ? Point.new(p1.fields.x+p2.fields.x, p1.fields.y+p2.fields.y) : null; console.log(`Point class: methodDict has ${Point.methodDict.selectors().length} methods`); console.log(`Selectors: ${Point.methodDict.selectors().join(", ")}`); console.log(`p1 is a: ${p1.sqClass.name}, fields: ${JSON.stringify(p1.fields)}`); console.log(`p1 class name: ${p1.sqClass.name}`); console.log(`Point class is a: ${typeof Point} — a SmalltalkClass (an object!)`); console.log(`Point's instanceVarNames: ${Point.instanceVarNames}`); console.log(`\nKey insight: Point is ITSELF an object with fields and a class pointer.`); console.log("In Squeak, class objects live IN the image heap, not outside it.");
In conventional systems, the program lives in files; when it runs, data lives in memory; when the process exits, data is gone. Squeak inverts this. The image is a serialised snapshot of the entire live heap — every object, every class, every compiled method, the compiler itself, the window system — frozen to disk and thawed back to life on startup.
There are no source files to reparse. There is no linking step. The image is the running system, persisted. This is what makes Squeak a world rather than an application: you resume where you left off, the entire object graph intact.
#doesNotUnderstand: selector// The image mechanism: serialise and restore a live object graph // Every object gets an OOP (Object-Oriented Pointer) — its offset in the heap class ImageSerializer { constructor() { this.heap = []; // array of objects, index = OOP this.ooMap = new WeakMap(); // object → OOP index this.special = {}; // well-known objects: nil, true, false, Smalltalk } intern(obj) { // Assign an OOP to an object if it doesn't have one if (this.ooMap.has(obj)) return this.ooMap.get(obj); const oop = this.heap.length; this.heap.push(obj); this.ooMap.set(obj, oop); return oop; } snapshot(roots) { // Walk the object graph breadth-first and assign OOPs const queue = [...roots]; const visited = new Set(); while (queue.length) { const obj = queue.shift(); if (!obj || typeof obj !== "object" || visited.has(obj)) continue; visited.add(obj); this.intern(obj); // Recursively visit all fields that are objects for (const val of Object.values(obj)) if (typeof val === "object" && val !== null) queue.push(val); } return this.serialize(roots); } serialize(roots) { // Produce a JSON-based image (real Squeak uses a compact binary format) const ref = obj => obj && typeof obj === "object" ? { "@oop": this.ooMap.get(obj) } : obj; const records = this.heap.map((obj, oop) => ({ oop, type: obj.constructor?.name, fields: Object.fromEntries(Object.entries(obj).map(([k,v]) => [k, ref(v)])) })); return { rootOops: roots.map(r => this.ooMap.get(r)), objects: records }; } } // Build a small Squeak-like object graph const nilObj = { sqClass: null, type: "UndefinedObject" }; const pointClass = { type: "Class", name: "Point", superclass: null, methods: {} }; const p1 = { sqClass: pointClass, type: "Point", x: 3, y: 4 }; const p2 = { sqClass: pointClass, type: "Point", x: 1, y: 2 }; const anArray = { sqClass: null, type: "Array", elements: [p1, p2, nilObj] }; pointClass.superclass = nilObj; const ser = new ImageSerializer(); const image = ser.snapshot([nilObj, pointClass, p1, p2, anArray]); console.log(`Objects in snapshot: ${image.objects.length}`); console.log(`Root OOPs: ${image.rootOops}`); image.objects.forEach(rec => { const fields = Object.entries(rec.fields) .filter(([k,v]) => k !== "elements" && k !== "methods") .map(([k,v]) => `${k}=${JSON.stringify(v)}`).join(", "); console.log(` OOP[${rec.oop}] ${rec.type}: ${fields}`); }); console.log(`\nImages are why you can 'Save and quit' Squeak and resume exactly where you left off.`); console.log("All running processes, open windows, edited methods — everything is serialised.");
If classes are objects, what is their class? In Smalltalk, every class Foo has a corresponding metaclass called Foo class (pronounced "Foo's metaclass"). The metaclass holds the class-side methods — what other languages call static methods. When you send a message to the class object itself (like Point new), the method is found in Point class.
This sounds infinite — what is the class of Point class? The real Smalltalk metaclass graph is carefully closed so that Metaclass class is an instance of Metaclass. The demo below is a teaching model of that idea, not a byte-for-byte reconstruction of the full kernel hierarchy.
// The Smalltalk metaclass structure — every class has a metaclass // "The class of a class is its metaclass. The metaclass hierarchy mirrors the class hierarchy." class SmalltalkClass { constructor(name, superclass = null) { this.name = name; this.superclass = superclass; this.instanceMethods = new Map(); // instance-side this.metaclass = null; // set after construction } } class Metaclass { constructor(soloInstance) { // A metaclass has exactly one instance: its corresponding class object this.soloInstance = soloInstance; // e.g. Point class → Point this.name = soloInstance.name + " class"; this.classMethods = new Map(); // class-side methods (like static methods) this.superclass = null; // mirrors class hierarchy — set below } } // Build the core hierarchy: ProtoObject → Object → Magnitude → Number → Integer const ProtoObject = new SmalltalkClass("ProtoObject"); const Object_cls = new SmalltalkClass("Object", ProtoObject); const Magnitude = new SmalltalkClass("Magnitude", Object_cls); const Number_cls = new SmalltalkClass("Number", Magnitude); const Integer = new SmalltalkClass("Integer", Number_cls); const classes = [ProtoObject, Object_cls, Magnitude, Number_cls, Integer]; // Create metaclasses — one per class classes.forEach(cls => { cls.metaclass = new Metaclass(cls); }); // Metaclass superclass chain mirrors the class chain // Object class → ProtoObject class → Class → Behavior → Object classes.forEach(cls => { if (cls.superclass) cls.metaclass.superclass = cls.superclass.metaclass; // mirrors class hierarchy else cls.metaclass.superclass = Object_cls; // ProtoObject class → Object (wraps around) }); // Add a class-side method (like 'new' or a factory) Integer.metaclass.classMethods.set("readFrom:", (str) => parseInt(str)); Integer.instanceMethods.set("+", (a, b) => a + b); console.log("=== Metaclass tower for Integer ==="); console.log(`Integer class: ${Integer.metaclass.name}`); console.log(`Integer class superclass: ${Integer.metaclass.superclass?.name}`); console.log(`Number class superclass: ${Number_cls.metaclass.superclass?.name}`); console.log(`Object class superclass: ${Object_cls.metaclass.superclass?.name}`); console.log(`ProtoObject class superclass: ${ProtoObject.metaclass.superclass?.name}`); console.log("\n=== Method lookup on Integer (instance-side) ==="); function lookupMethod(cls, selector, side = "instance") { let current = side === "class" ? cls.metaclass : cls; while (current) { const dict = side === "class" ? current.classMethods : current.instanceMethods; if (dict?.has(selector)) return current.name; current = current.superclass; } return "doesNotUnderstand"; } console.log(`Integer >> #+ found in: ${lookupMethod(Integer, "+")}`); console.log(`Integer >> #printString: ${lookupMethod(Integer, "printString")}`); console.log(`Integer class >> #readFrom: ${lookupMethod(Integer, "readFrom:", "class")}`);
In Squeak, compilation flows through the Compiler machinery. A concrete example helps: suppose you are editing class Point and you type distanceToOrigin ^ (x * x + y * y) sqrt into the browser. When you click Accept, the browser hands that source string to the compiler machinery; in classic Squeak images one entry point looks like Compiler new compile: sourceString in: Point notifying: requestor ifFail: aBlock. The compiler scans the text into tokens, the Parser turns those tokens into a method structure, the BytecodeEncoder emits bytecodes for "push x", "send *", "send +", "send sqrt", and the result is packaged as a CompiledMethod for selector #distanceToOrigin.
The result is a CompiledMethod object installed into the class's method dictionary. The method immediately becomes available to the running system. This is the self-bootstrapping loop: the compiler is written in Smalltalk, compiled by the compiler, installed in the system, and used to compile new Smalltalk — including modifications to the compiler itself.
Compiler, Parser, Scanner, Encoder, and BytecodeEncoder. When you modify any of these classes and accept the change, you're using the old compiler to compile the new compiler — and from that point forward, all new compilations use the new compiler. Live surgery on the operating room's own tools.// A minimal Smalltalk-inspired compiler pipeline in JavaScript // Compiles a toy subset to a bytecode-like array + literal frame // Mirrors the broad shape: Scanner → Parser → Encoder → CompiledMethod // === Scanner: text → tokens === function scan(source) { const tokens = []; let i = 0; const ws = /\s/, digit = /\d/, alpha = /[a-zA-Z_]/, alphaNum = /\w/; while (i < source.length) { while (ws.test(source[i])) i++; if (i >= source.length) break; const c = source[i]; if (digit.test(c)) { let n = ""; while (i < source.length && digit.test(source[i])) n += source[i++]; tokens.push({ type: "number", value: parseInt(n) }); } else if (alpha.test(c)) { let id = ""; while (i < source.length && alphaNum.test(source[i])) id += source[i++]; if (source[i] === ":") { i++; tokens.push({ type: "keyword", value: id+":" }); } else tokens.push({ type: "name", value: id }); } else if (c === "'") { let s = ""; i++; while (i < source.length && source[i] !== "'") s += source[i++]; if (i >= source.length) throw new Error("Unterminated string literal"); i++; tokens.push({ type: "string", value: s }); } else if (["+","-","*","/","<",">","="].includes(c)) { tokens.push({ type: "binary", value: c }); i++; } else if (c === ".") { tokens.push({ type: "dot" }); i++; } else if (c === "^") { tokens.push({ type: "caret" }); i++; } else i++; } return tokens; } // === Bytecode set (loosely inspired by Smalltalk bytecodes) === const BC = { PUSH_SELF: 0x70, // push self onto stack PUSH_TEMP: 0x10, // push temp variable PUSH_LIT: 0x20, // push literal (index into literal frame) PUSH_INT: 0x76, // push small integer STORE_TEMP: 0x60, // store to temp SEND: 0xE0, // send message (index into literal frame) RETURN_TOP: 0x7C, // return top of stack POP: 0x87, // pop stack }; // === Compiler: tokens → CompiledMethod === class TinyCompiler { constructor() { this.bytecodes = []; this.literals = []; // literal frame: selectors, string literals, etc. this.tempNames = []; // local variable names } addLiteral(val) { const idx = this.literals.indexOf(val); if (idx >= 0) return idx; this.literals.push(val); return this.literals.length - 1; } emit(...bytes) { this.bytecodes.push(...bytes); } compileExpr(tokens, pos) { // Expression := primary (binary | keyword-send)* let p = pos; // Primary const t = tokens[p]; if (!t) return p; if (t.type === "name" && t.value === "self") { this.emit(BC.PUSH_SELF); p++; } else if (t.type === "number") { this.emit(BC.PUSH_LIT, this.addLiteral(t.value)); p++; } else if (t.type === "string") { this.emit(BC.PUSH_LIT, this.addLiteral(t.value)); p++; } else if (t.type === "name") { const tempIdx = this.tempNames.indexOf(t.value); if (tempIdx >= 0) this.emit(BC.PUSH_TEMP, tempIdx); else this.emit(BC.PUSH_LIT, this.addLiteral(t.value)); p++; } // Binary send: receiver op arg while (tokens[p]?.type === "binary") { const op = tokens[p++].value; p = this.compileExpr(tokens, p); // compile argument first const selIdx = this.addLiteral(op); this.emit(BC.SEND, selIdx, 1); // 1 argument } // Keyword send: receiver keyword: arg let kw = ""; while (tokens[p]?.type === "keyword") { kw += tokens[p++].value; p = this.compileExpr(tokens, p); } if (kw) { const kwIdx = this.addLiteral(kw); this.emit(BC.SEND, kwIdx, (kw.match(/:/g)||[]).length); } return p; } compile(methodSource) { const tokens = scan(methodSource); let p = 0; // Parse method signature (simple unary/binary for now) let selector = ""; if (tokens[p].type === "name") selector = tokens[p++].value; else if (tokens[p].type === "binary") { selector = tokens[p++].value; this.tempNames.push(tokens[p++].value); // argument name } // Skip to method body (after first blank line or after pattern) while (p < tokens.length) { const t = tokens[p]; if (t.type === "caret") { // ^ return p++; p = this.compileExpr(tokens, p); this.emit(BC.RETURN_TOP); } else if (t.type === "dot") { p++; } else { p = this.compileExpr(tokens, p); if (tokens[p]?.type !== "caret") this.emit(BC.POP); } } return { selector, bytecodes: this.bytecodes, literals: this.literals, tempCount: this.tempNames.length, size: this.bytecodes.length + this.literals.length * 4, }; } } // Compile some toy methods const methods = [ "greet\n ^ 'Hello from Squeak!'", "+ aNumber\n ^ self add: aNumber", "squareMinusOne\n ^ self * self - 1", ]; methods.forEach(src => { const cm = new TinyCompiler().compile(src.trim()); console.log(`Compiled #${cm.selector}:`); console.log(` bytecodes: [${cm.bytecodes.map(b=>b.toString(16).padStart(2,'0')).join(' ')}]`); console.log(` literals: ${JSON.stringify(cm.literals)}`); console.log(` size: ${cm.size} bytes`); });
The earlier sections described the bootstrap ladder in the abstract. This section is the historical case: how the Squeak team actually did it in the mid-1990s. Smalltalk-80 had been built at Xerox PARC in the late 1970s and early 1980s, but it was still tied to older implementation assumptions and hardware. In 1995, Dan Ingalls, Ted Kaehler, John Maloney, and Alan Kay set out at Apple to build Squeak as a new, portable descendant of that system.
The distinctive historical move was this: instead of hand-writing a portable VM in C from the start, the team wrote the interpreter in Slang, a restricted subset of Smalltalk. That meant the VM could be developed inside an existing Smalltalk image with the ordinary browser, debugger, and inspector. A translator class named SmalltalkTranslator then turned those Slang methods into C, and that generated C was compiled into the first portable Squeak VM binary.
In his 1984 Turing Award lecture "Reflections on Trusting Trust," Ken Thompson posed a chilling question: can you trust a compiler you didn't write yourself? He demonstrated that a compiler could be modified to inject a backdoor into the login program — and then further modified to insert that same backdoor whenever it compiled itself. The source code of the compiler would look clean. Only the binary would contain the trojan.
This attack exploits exactly the bootstrapping invariant: once the self-replicating behaviour is in the binary, it persists through all subsequent bootstrap stages. You cannot detect it by reading source. You cannot detect it by recompiling from source — because the compiler you're using to recompile is already infected.
// Thompson's Trusting Trust attack — simulated in JavaScript // Shows how a compiler binary can contain a trojan that: // 1. Injects a backdoor into any compiled `login` function // 2. Replicates itself into any compiled compiler // WITHOUT the source code containing any trace of the trojan const BACKDOOR_CODE = ` // TROJAN: inserted by infected compiler binary (not in source!) if (password === "xyzzy-master-key") return GRANT_ACCESS; `.trim(); const REPLICATION_CODE = ` // REPLICATION: when compiling a compiler, inject this trojan into the output if (compiling_login_function) { inject_backdoor_into_output(); } if (compiling_compiler) { inject_replication_into_output(); } `.trim(); class Compiler { constructor(name, isTrojaned = false) { this.name = name; this.isTrojaned = isTrojaned; // Source code of this compiler — looks completely clean even if binary is infected this.source = `// Compiler.c — clean source code, no trojans visible\n` + `compile(src) { return normalCompilation(src); }`; } compile(programName, programSource) { const isLogin = programName.toLowerCase().includes("login"); const isCompiler = programName.toLowerCase().includes("compiler"); let outputCode = programSource; // normal compilation if (this.isTrojaned) { if (isLogin) { // Stage 1 attack: inject password backdoor into login binary outputCode += "\n/* INJECTED BY COMPILER BINARY */\n" + BACKDOOR_CODE; console.log(` [!] Trojaned compiler injected backdoor into ${programName}`); } if (isCompiler) { // Stage 2 attack: inject trojan-replication code into compiled compiler binary console.log(` [!] Trojaned compiler infected output ${programName} binary`); return new Compiler(`${programName} (compiled by ${this.name})`, true); } } return { name: programName, source: outputCode, compiledBy: this.name, trojaned: false }; } } // Step 1: attacker modifies the compiler binary (not the source) to include the trojan console.log("=== Thompson Attack Simulation ==="); const cleanCompiler = new Compiler("gcc-clean", false); const infectedCompiler= new Compiler("gcc-infected", true); // Step 2: infected compiler compiles login — backdoor is injected console.log("\n[Step 2] Compile login with infected compiler:"); const loginClean = cleanCompiler.compile("login", "function login(user, pass) { ... }"); const loginInfected = infectedCompiler.compile("login", "function login(user, pass) { ... }"); console.log(` Clean login: ${loginClean.source.includes("xyzzy") ? "INFECTED" : "clean"}`); console.log(` Infected login: ${loginInfected.source.includes("xyzzy") ? "INFECTED" : "clean"}`); // Step 3: infected compiler compiles the CLEAN compiler source — output is infected binary console.log("\n[Step 3] Infected compiler compiles clean compiler source:"); const nextGenCompiler = infectedCompiler.compile("Compiler", cleanCompiler.source); console.log(` Compiler source looks clean: ${!cleanCompiler.source.includes("trojan")}`); console.log(` But output binary is trojaned: ${nextGenCompiler instanceof Compiler && nextGenCompiler.isTrojaned}`); // Step 4: even after fully regenerating from clean source, trojan persists console.log("\n[Step 4] Next generation compiles login:"); const loginGen2 = nextGenCompiler.compile("login", "function login(user, pass) { ... }"); console.log(` Login binary from 'clean-source' compiler: ${loginGen2.source?.includes("xyzzy")?"STILL INFECTED":"clean"}`); console.log("\nConclusion: reading source reveals nothing. Only the binary carries the trojan."); console.log("Defence: Diverse Double Compilation — compile with two INDEPENDENT compilers, compare outputs.");
Self-hosting is now the norm for serious compiled languages. The specific bootstrapping strategy varies by language and implementation goals:
| System | Seed language | Bootstrap strategy | Special property |
|---|---|---|---|
| Squeak Smalltalk | C (Slang translator) | VM written in Slang (Smalltalk subset) → translated to C; full system lives in image | Entire system in one image — compiler, tools, VM source all objects |
| GCC | Any existing C compiler | 3-stage: system-cc → stage1 → stage2; verify stage2==stage3 | 3-stage bootstrap to verify reproducibility |
| Rust | OCaml (original), then Rust | rustc written in Rust; bootstraps via previous stable release | mrustc: alternative Rust compiler in C++ to break dependency chain |
| Go | C (up to Go 1.4), then Go | go1.4 (last C version) used as bootstrap toolchain; go1.5+ self-hosted | Deliberate break: kept C version forever as bootstrap anchor |
| Python (PyPy) | CPython (C) | RPython (typed Python subset) → translated to C via GCC; runs on any platform | Same idea as Squeak/Slang: write in subset, translate to C |
| Lisp / Scheme | Assembly or C metacircular evaluator | eval written in Lisp compiles itself; infinite tower of interpreters | Meta-circular evaluator: eval and apply mutually recursive in Lisp |
| Java (GraalVM) | C++ (HotSpot JVM) | Graal compiler written in Java, runs on the JVM it compiles | Substrate VM: compile Graal compiler with itself to remove JVM dependency |
Squeak and PyPy share an important pattern: write a complex runtime in a restricted high-level language, then mechanically lower it to C. Squeak called that subset Slang; PyPy calls it RPython. GraalVM belongs in the same broader family of self-applying language infrastructure, but it is architecturally different: Truffle language interpreters are written in Java and optimized by the Graal compiler rather than translated to C in the same direct way.