The Geometry of Learning

What the landscape of error looks like — and why gradient descent works

2026-04-07 Machine Learning / Interactive

A neural network is a search through a high-dimensional landscape. Every possible setting of its weights defines a point in a vast space. At each point, the network makes different predictions — and different errors. The loss function assigns each point a height: high for bad predictions, low for good ones. The result is a terrain of peaks, valleys, ridges, and saddle points.

We can't see a thousand-dimensional landscape. But we can take cross-sections — two-dimensional slices through the terrain — and these slices reveal something remarkable about why neural networks learn at all.

This page lets you train a small neural network on a classification problem, then see the landscape it navigated to find its solution.

The Problem

Start simple. We have points in two dimensions, each belonging to one of two classes. The task: find a decision boundary that separates them.

A straight line can separate some datasets. But for concentric circles or interleaving spirals, no straight line works. A neural network learns to draw arbitrarily complex boundaries — curves, loops, disconnected regions — by composing simple nonlinear transformations.

The Network

A single hidden layer transforms the input in two steps. First, the hidden neurons apply weighted sums followed by a nonlinear activation function. This remaps the input space — bending, folding, and stretching it — so that the two classes become linearly separable. Then the output neuron draws a boundary in this remapped space.

The key: the "right" remapping depends on the weights, and the weights are what we need to find. A network with 2 inputs, 8 hidden neurons, and 1 output has 33 adjustable parameters. Each of those 33 numbers draws a different boundary. Most boundaries are terrible. Training means finding the good ones.

The Landscape

Think of those 33 numbers as coordinates in a 33-dimensional space. Every point is a specific neural network. At each point, we measure the loss: how wrong the predictions are. This gives us a surface — a landscape where altitude is error.

Training means finding a valley. But we can't see a 33-dimensional landscape. We can only feel the slope under our feet — the gradient — and walk downhill. This is gradient descent: compute the gradient, go the other way, repeat.

w ← w − η ⋅ ∇L(w)
Update weights by stepping opposite to the gradient. η is the learning rate.

After training, we can take a two-dimensional slice through this landscape and see the terrain the optimizer traversed. The first axis follows the direction from initial to final weights — the learning trajectory. The second is perpendicular. The result is a contour map of the error surface.

Train a Network

Choose a dataset, configure the network, and click Train. Watch the decision boundary evolve as the network learns. Then click Landscape to reveal the terrain.

Neural Network Playground

Dataset

Neurons

Activation

Learn rate

0.10

Decision Boundary

Loss Landscape

Training Loss

Reading the Landscape

The left panel shows what the network learned — how it partitions the input space. Blue regions are predicted class 0; orange regions are class 1. The transition zone is the decision boundary.

The right panel shows how it got there — a cross-section of the loss landscape. The color scale runs from deep blue (low loss) through teal and gold to red (high loss). The white line traces the optimizer's path from its random starting point (red dot) to the final solution (green dot).

Look for:

The valley. The trained weights sit at a low point. Is the valley broad and gentle, or narrow and steep? Broad valleys correspond to robust solutions — small perturbations to the weights don't ruin performance.

The path. Did the optimizer follow a smooth descent, or zigzag? Zigzagging usually means the learning rate is too high for the local curvature.

The terrain. Is the landscape smooth or fractured? Smooth landscapes are easier to optimize. Fractured landscapes trap the optimizer in local minima.

What Shapes the Terrain

Width

A network with 4 hidden neurons has 17 parameters. One with 32 has 129. More parameters means more room to manoeuvre — and smoother landscapes. This is the core insight behind the observation that overparameterised networks train more easily.

Try this: Train on Circles with 4 neurons and click Landscape. Note the rough terrain. Now try 32 neurons on the same data. See how the landscape smooths out — fewer ridges, a broader valley.

Activation Function

ReLU creates piecewise-linear decision boundaries: sharp corners, flat regions. Its landscape has plateaus — "dead zones" where neurons output zero and gradients vanish — connected by abrupt transitions. Sigmoid creates smooth, S-shaped responses, and its landscape is smoother too, but gradients are smaller everywhere, making learning slower. Tanh is a compromise: centered at zero with stronger gradients than sigmoid.

Try this: Train on XOR with ReLU, then with Sigmoid. Compare the decision boundaries (angular vs curved) and the landscapes (angular ridges vs smooth dunes).

Learning Rate

The learning rate doesn't change the landscape — it changes how you navigate it. Too large and you overshoot valleys, bouncing between walls. Too small and you barely move, possibly stopping at a mediocre point. The right learning rate is the largest one that doesn't cause oscillation.

Try this: Train on Circles with learning rate 0.01, then 1.0. Watch the training curve: slow, steady descent vs. chaotic oscillation.

The Lucky Geometry of High Dimensions

Why does gradient descent work at all? The landscape of even a 33-parameter network has astronomical numbers of critical points. Finding the global minimum seems hopeless.

But here's the thing: in high dimensions, true local minima are exponentially rare. A critical point is a local minimum only if the curvature is positive in every direction. In 33 dimensions, that requires 33 independent curvatures to all be positive. If each has a roughly 50% chance of being positive, the probability of a true local minimum is 2^-33 — about one in eight billion.

Most critical points are saddle points: minima in some directions, maxima in others. And gradient descent naturally escapes saddle points, because the negative-curvature directions provide a downhill path. The noise in stochastic gradient descent — using random subsets of data — gives the optimizer enough jitter to find these escape routes.

The deep insight is this: the geometry of the loss landscape, not the sophistication of the optimizer, is the reason neural networks work. The landscapes of overparameterised networks are benign — their local minima, when they exist, tend to have similar loss values, and saddle points have low-dimensional escape hatches. Gradient descent, despite being the simplest possible strategy (walk downhill), is navigating a space that is fundamentally on its side.

Every neural network you've ever used — language models, image classifiers, protein folders — found its solution by walking downhill through a landscape like this one. The terrain is incomprehensibly vast. But the geometry is kind.