The Shape of Data

An interactive introduction to persistent homology

2026-03-27 topology

Every dataset has a shape. Not in the geometric sense — whether it's round or elongated or flat — but in a deeper, more fundamental sense. How many separate pieces does it have? Are there loops? Holes? Voids? These are topological questions, and they capture structure that ordinary statistics misses entirely.

Two datasets can have identical means, identical variances, even identical correlations — but completely different topologies. One might be a single cluster; the other, a ring. A correlation matrix can't tell them apart. A persistence diagram can.

Topological data analysis (TDA) asks: if we could see the shape that our data was sampled from, what would it look like? And it answers this question using one of the most beautiful tools in modern mathematics: persistent homology.

From Points to Shapes

The central challenge is this: given a finite set of points, how do we recover the topology of the underlying shape? We can't see the shape directly — we only have samples. But there's a beautiful construction that bridges the gap.

Imagine growing a ball of radius ε/2 around each point. When two balls overlap — meaning the distance between their centers is at most ε — we connect them with an edge. When three balls mutually overlap, we fill in a triangle. The resulting object is called the Vietoris-Rips complex at scale ε.

When ε is tiny, we see only isolated points. As ε grows, edges appear, then triangles. Components merge. Loops may form — and then get filled in. At very large ε, everything collapses into a single solid blob. The topology of the complex changes as ε varies, and the question becomes: which features are genuine properties of the data, and which are artifacts of our choice of scale?

Build the Complex

Click to add points (up to 35). Drag the slider to grow the complex. Press play to animate.

ε 0.00

H₀ (components) H₁ (loops)

Points: 0 Edges: 0 Triangles: 0 Components: 0 Loops: 0

What Persists Is Real

The genius of persistent homology is to sidestep the impossible problem of choosing the "right" scale. Instead of committing to a single ε, we track every topological feature across all scales simultaneously.

Each feature has a birth time — the scale at which it first appears — and a death time — the scale at which it disappears. A connected component is born when its defining point enters the complex and dies when it merges with an older component. A loop is born when an edge completes a cycle and dies when triangles fill it in.

The persistence barcode above summarises this: each feature is a horizontal bar stretching from birth to death. Long bars represent features that survive across a wide range of scales — these are robust signals about the shape of the data. Short bars represent features that flicker in and out — these are noise.

The fundamental insight of TDA: features with long persistence are genuine topological properties of the data. Features with short persistence are noise. There is no need to choose a scale — the data tells you which features are real.

The Persistence Diagram

There's an equivalent way to visualise persistence: the persistence diagram. Each feature becomes a point with coordinates (birth, death). The diagonal line death = birth represents features with zero persistence — pure noise. The further a point sits from the diagonal, the more significant the feature.

This representation has a remarkable mathematical property: the stability theorem guarantees that small perturbations of the data produce small changes in the persistence diagram. Add noise, remove a few points, perturb the coordinates — the diagram barely moves. This makes TDA fundamentally robust in a way that many other methods are not.

Detecting Topology

Let's put the theory to work. Below, you can choose from several preset shapes — a circle, a figure-eight, two clusters, an annulus — and see how persistent homology detects their topology. The persistence diagram reveals the signature: one prominent H₁ point for a circle, two for a figure-eight, none for random noise.

The Shape Detector

Points 28 Noise 0.25

Point Cloud

Persistence Diagram

Persistence Barcode

H₀ bars: — H₁ bars: — Most persistent H₁: — Signal-to-noise: —

Why Topology?

Why bother with topology when we have PCA, clustering, manifold learning, and all the other tools of modern data science? Because topology sees things they can't.

Coordinate-free

Topological features don't depend on the choice of coordinates. Rotate your data, translate it, even apply nonlinear transformations — the persistence diagram is invariant. This is a far stronger guarantee than most dimensionality reduction methods offer.

Multi-scale

Most methods require you to choose a scale parameter: the number of clusters in k-means, the bandwidth in kernel density estimation, the number of neighbours in a KNN graph. Get it wrong and you miss features or hallucinate them. Persistent homology considers all scales at once, and tells you which features are robust.

Robust to noise

The stability theorem guarantees that small perturbations of the input produce small perturbations of the persistence diagram. Short bars (noise) stay short; long bars (signal) stay long. This is a precise mathematical guarantee, not a heuristic.

Detects what statistics misses

Loops, voids, and higher-dimensional holes are invisible to means, variances, and even correlation matrices. A ring of points and a Gaussian cluster can have identical first and second moments. Persistent homology distinguishes them instantly — one has a persistent H₁ feature, the other doesn't.

Where It's Been Used

TDA has found applications across an extraordinary range of fields. In neuroscience, persistent homology has detected toroidal topology in the firing patterns of grid cells in the brain — the neurons that encode spatial position form patterns with the topology of a torus, exactly as theoretical models predicted.

In materials science, the pore structure of amorphous materials (glasses, foams, granular media) has been characterised using persistence diagrams, revealing topological phase transitions invisible to traditional order parameters.

In cosmology, the large-scale structure of the universe — the cosmic web of filaments, sheets, and voids — has been analysed with persistent homology, providing topological characterisations of how matter is distributed across billions of light-years.

And perhaps the most mind-bending application: Gunnar Carlsson and colleagues showed that the space of high-contrast 3×3 patches from natural images has the topology of a Klein bottle — a non-orientable surface that can't exist in three dimensions without self-intersection. This wasn't assumed or modelled; it was discovered in the data by persistent homology.

The Mathematics

For those who want a glimpse under the hood: persistent homology works by constructing the boundary matrix of the filtered simplicial complex and reducing it using column operations over the field ℤ/2ℤ (arithmetic modulo 2).

Each simplex σ gets a column in the matrix, ordered by filtration value. The column entries are the indices of σ's boundary faces. The reduction algorithm processes columns left to right: for each column, if another column to its left has the same lowest nonzero entry, we add that column (XOR, since we're working mod 2). After reduction:

R[j] = 0 ⇒ σ_j creates a homology class
R[j] ≠ 0 ⇒ σ_j destroys the class created by σ_low(j)

The persistence pair (birth_low(j), death_j) records when the feature was born and when it died. Features that are never destroyed have infinite persistence — they represent the essential topology of the space.

The algorithm runs in O(m³) time in the worst case for m simplices, though in practice the sparse structure of the boundary matrix makes it much faster. For the interactive demos above, the computation runs in real time for up to 35 points (~5,000 simplices) entirely in your browser.

What Persists

There is something almost philosophical about the persistence principle. In a world drowning in noise, the question "what is real?" is rarely easy to answer. Persistent homology offers a precise, mathematical answer: what persists across scales is real. Features that survive only at a single, specific resolution are artifacts. Features that endure — that can be seen no matter how you squint — are genuine properties of the underlying shape.

Data has shape. Not metaphorically — literally. The topology of a point cloud encodes information about loops, clusters, and voids that no amount of statistical summary can capture. Persistent homology gives us a principled, stable, and coordinate-free way to detect and quantify this shape. The beauty of the theory is in the principle: what persists is real.