Appendix A — Appendix A: Mathematical Notation Reference
A.1 Why This Appendix Exists
Many programmers are comfortable with code but less comfortable with compact mathematical notation. That is normal. The purpose of this appendix is not to teach advanced mathematics. It is to make the notation in this book easy to parse at a glance.
When you see a formula in the chapters, use this appendix as a decoder ring.
A.2 The Basic Objects
A.2.1 Variables and Values
x,y,zusually denote specific values or outcomes.X,Y,Zusually denote random variables.- If you see
X = x, read it as: “the random variableXtook the valuex.”
Example:
P(X = 1) = 0.25
Read this as: “the probability that X equals 1 is 0.25.”
A.2.2 Distributions
PandQusually denote probability distributions.p(x)means “the probability assigned by distributionPto valuex.”q(x)means the same thing for distributionQ.
In discrete settings, these are interchangeable:
P(X = x)
p(x)
Both mean “the probability of outcome x.”
A.2.3 Samples
x ~ Pmeans “xis drawn from distributionP.”x₁, x₂, ..., xₙmeans a sequence of samples.
If subscripts are annoying, read x_i as “x sub i” or just “the i-th x.”
A.3 Sums, Products, and Indexing
A.3.1 Summation
∑ p(x) log₂ p(x)
Read this as: “sum over all possible values of x.”
In Python, this is usually:
sum(p[x] * math.log2(p[x]) for x in p)A.3.2 Product
∏ p(x_i)
Read this as: “multiply these terms together.”
Products show up when independent probabilities combine.
A.3.3 Indices
x_imeans the i-th element of a sequence.P_imeans the i-th probability.i = 1, ..., nmeans “iruns from 1 throughn.”
A.4 Logs and Units
A.4.1 Logarithms
log₂(x)means base-2 logarithm.ln(x)means natural logarithm, basee.log₁₀(x)means base-10 logarithm.
This book mostly uses log₂, because information in base 2 is measured in bits.
A.4.2 Bits, Nats, and Bans
- bits: base-2 logarithms
- nats: natural logarithms
- bans or hartleys: base-10 logarithms
Conversions:
1 nat = 1 / ln(2) ≈ 1.4427 bits
1 bit = ln(2) ≈ 0.6931 nats
1 ban = log₂(10) ≈ 3.3219 bits
A.5 Probability Notation
A.5.1 Joint Probability
P(X = x, Y = y)
This means the probability that X = x and Y = y happen together.
Shorthand:
p(x, y)
A.5.2 Conditional Probability
P(Y = y | X = x)
Read this as: “the probability that Y = y given that X = x.”
The vertical bar | always means “given.”
A.5.3 Independence
P(X, Y) = P(X)P(Y)
This means X and Y are independent: knowing one tells you nothing about the other.
A.6 Expectation and Averages
A.6.1 Expected Value
E[X]
Read this as: “the expected value of X.”
For a discrete variable:
E[X] = ∑ x P(X = x)
This is a probability-weighted average.
A.6.2 Expected Value of a Function
E[f(X)]
This means: apply f to X, then average with respect to the distribution of X.
Example:
H(X) = E[-log₂ P(X)]
Entropy is the expected surprise.
A.7 The Core Information-Theoretic Quantities
A.7.1 Surprise
I(x) = -log₂ p(x)
The information content, or surprise, of outcome x.
A.7.2 Entropy
H(X) = -∑ p(x) log₂ p(x)
Average surprise of a random variable.
A.7.3 Joint Entropy
H(X, Y)
Uncertainty in the pair (X, Y).
A.7.4 Conditional Entropy
H(Y | X)
The uncertainty remaining in Y after X is known.
Chain rule:
H(X, Y) = H(X) + H(Y | X)
A.7.5 Cross-Entropy
H(P, Q) = -∑ p(x) log₂ q(x)
The average coding cost when the truth is P but you encode using Q.
A.7.6 KL Divergence
KL(P || Q) = ∑ p(x) log₂ [p(x) / q(x)]
The extra cost of using Q when the truth is P.
Read || as “relative to” or “compared to.”
Important: KL is not symmetric.
KL(P || Q) ≠ KL(Q || P)
A.7.7 Mutual Information
I(X; Y)
Read this as: “the mutual information between X and Y.”
Definitions:
I(X;Y) = H(X) - H(X|Y)
= H(Y) - H(Y|X)
= H(X) + H(Y) - H(X, Y)
= KL(P(X,Y) || P(X)P(Y))
A.7.8 Binary Entropy Function
H_b(p) = -p log₂ p - (1-p) log₂(1-p)
This is the entropy of a Bernoulli random variable: success with probability p, failure with probability 1-p.
It appears constantly in coding theory and channel capacity.
A.8 Optimization Notation
A.8.1 Maximum and Minimum
max_x f(x)
min_x f(x)
The largest or smallest value of f(x) as x varies.
A.8.2 Argmax and Argmin
argmax_x f(x)
argmin_x f(x)
These return the value of x that achieves the maximum or minimum.
Example:
C = max_{P(X)} I(X;Y)
Channel capacity is the maximum mutual information over all input distributions.
A.8.3 Subject To
maximize f(x)
subject to g(x) ≤ c
This means “optimize f(x) while obeying the constraint.”
A.9 Approximation and Asymptotics
A.9.1 Approximately Equal
≈
Read as “approximately equal.”
A.9.2 Proportional To
∝
Read as “proportional to.”
A.9.3 Goes To
n → ∞
Read as “n goes to infinity.”
A.9.4 Big-O
O(n log n)
Asymptotic growth rate. This shows up rarely in the book, but when it does, it means “grows on the order of.”
A.10 Common Greek Letters
μ(mu): often a meanσ(sigma): often a standard deviationσ²: varianceθ(theta): often a model parameterε(epsilon): often a small error rate or toleranceδ(delta): often a small changeλ(lambda): often a rate, eigenvalue, or regularization parameterρ(rho): often a correlation coefficientτ(tau): often a threshold or temperature parameter
There is no universal law here. Greek letters are conventions, not magic.
A.11 Reading Formulas in Plain English
Here are a few formulas from the book, translated directly.
H(X) = -∑ p(x) log₂ p(x)
“Entropy is the negative sum, over all outcomes, of probability times log probability.”
KL(P || Q) = H(P, Q) - H(P)
“KL divergence is the extra coding cost of using Q instead of the true distribution P.”
I(X;Y) = H(X) - H(X|Y)
“Mutual information is how much uncertainty in X disappears when you learn Y.”
C = max_{P(X)} I(X;Y)
“Capacity is the most information the channel can carry, after choosing the best input distribution.”
A.12 Final Advice
Do not try to memorize notation by force.
Instead, whenever you see a symbol:
- Ask what kind of object it is: value, random variable, function, or distribution.
- Ask whether the formula is summing, averaging, conditioning, or optimizing.
- Translate it into one plain-English sentence.
If you can do that, the notation stops being a wall and becomes compression for ideas, which is exactly what mathematical notation is supposed to be.