Approximate Answers

Good Enough Answers - Probabilistic Data Structures

Sometimes exactness is the wrong objective. If you need to ask whether a URL is in a set of ten billion URLs, or estimate the number of distinct users in a stream, the winning move is often to spend a tiny amount of memory on an answer that is almost always right and explicitly bounded when it is wrong.

Bloom filters for membership Count-min sketch for frequencies HyperLogLog for distinct counts Error as a design parameter

§01

The memory wall

Suppose you want to ask a simple question: does this URL exist in a set of ten billion URLs? A fully exact hash set may be fine on paper, but not when you only have about 1GB of RAM and still need room for everything else the program does.

This is where probabilistic data structures enter. They answer narrower questions with radically less memory, and they do so by being honest about the shape of their errors. Bloom filters allow false positives but not false negatives. Count-min sketch overestimates counts but never underestimates them. HyperLogLog estimates cardinality with a known relative error. They do not make every scale point magically fit, but they often make staged designs practical where exact storage is hopeless.

The key engineering move is not "be sloppy." It is "choose a cheap kind of wrongness that the system can tolerate." At 10B URLs and 1GB RAM, that likely means an approximate local stage plus a second check, not one giant exact in-memory set.

Exact set versus compact summary

Exact

Store every key

Memory scales with the number of elements plus allocator overhead, pointers, and metadata.

Probabilistic

Store evidence, not the elements

Memory scales with a target error budget: bits, counters, or registers instead of the raw key set.

URLs

10B

RAM budget

1GB

Question

membership?

Likely move

approx + staged

§02

Bloom filters

A Bloom filter is a bit array plus k hash functions. To insert a key, hash it k ways and set those bit positions to 1. To query a key, hash it the same ways and check those positions. If any bit is 0, the key is definitely absent. If all are 1, the key is probably present.

The false-positive rate is not hand-waving. For m bits, n inserted items, and k hashes, the standard approximation is (1 - e^(-kn/m))^k. That lets you size the filter from the workload instead of guessing.

Bloom filters buy tiny memory and fast negative lookups. In exchange, they cannot list contents, cannot count frequency, and cannot delete cleanly without upgrading to a counting variant.

Target a false-positive rate

Expected items

Bits needed

Hash functions

Observed false +

Sample workload

Memory per item

Selected query

Hash probes

bloom_filter.js - membership with explicit false-positive control

§03

Count-min sketch

Count-min sketch is for a different question: how often did this item appear in a stream? It keeps a small table of counters arranged in rows and columns. Each row has its own hash function. Updating an item increments one counter per row. Querying an item returns the minimum of those counters.

Why the minimum? Collisions only push counts upward, never downward. Taking the minimum across rows gives the least inflated answer available. The knobs are usually framed as epsilon for additive error and delta for failure probability, which translate to width and depth.

Count-min sketch is not trying to know everything. It is trying to know enough to spot heavy hitters and approximate frequencies in streams that are too large to store exactly.

Frequency estimation from a stream

Selected keyapple

Exact vs estimate

Why this answer

Width

Depth

Max additive error

Failure prob.

count_min_sketch.js - streaming counts with error controls

§04

HyperLogLog

HyperLogLog answers a third question: how many distinct elements have I seen? Instead of storing the elements, it hashes them, partitions the hash space into registers, and records how many leading zeros appear in the remaining bits. Rare long runs of leading zeros are evidence that many distinct items must have passed through.

A toy binary example helps. Suppose the hash for some item is 10100010 and you use the first 3 bits to choose a register. Then 101 | 00010 means “go to register 5, then inspect the remaining bits 00010.” That suffix starts with three zeros before the first one, so the observed rank is 4. HyperLogLog keeps the largest such rank seen in each register.

The main control is precision: more registers mean lower relative error, roughly 1.04 / sqrt(m) where m is the number of registers. That is why systems like Redis can offer approximate cardinality with a tiny, nearly fixed memory footprint.

HyperLogLog is one of the cleanest examples of "scale by changing the question." You stop asking for the set itself and ask only for the size of the set.

Precision versus error

Registers

Exact distinct

Estimated distinct

Expected rel. error

Memory shape

Actual relative error

Selected sample

Hash split

hyperloglog.js - approximate distinct counting with precision control

§05

The philosophy

There is a recurring engineering mistake: treating approximation as a moral failure rather than a workload decision. In real systems, exactness competes with memory, latency, bandwidth, and operational simplicity. If the tolerated error is explicit and the business consequence is acceptable, an approximate answer can be the higher-quality design.

That is the deeper lesson of these structures. They do not merely compress data. They compress decision-relevant information. A Bloom filter only needs to know enough to skip work most of the time. A count-min sketch only needs to know enough to rank frequent items. A HyperLogLog only needs to know enough to estimate cardinality.

Trading exactness for scale is not surrender. It is choosing the cheapest answer that still supports the product decision you actually need to make.

§06

Case studies

Chrome Safe Browsing is a good reminder to stay concrete. Current public docs emphasize downloaded local lists and hash-prefix lookups rather than advertising a Bloom filter by name. But the engineering instinct is the same: keep a compact local structure on-device, rule out the vast majority of URLs cheaply, and only escalate when needed.

Cassandra uses Bloom filters directly in the read path to avoid touching SSTable data files that definitely do not contain the requested partition key. It even exposes bloom_filter_fp_chance as a tuning parameter because the RAM-versus-I/O trade is operationally important.

Redis ships HyperLogLog as a first-class command family: PFADD, PFCOUNT, and PFMERGE. The attraction is exactly what this explainer shows: a tiny, nearly fixed memory footprint for distinct counting, with a known standard error instead of exact set storage.