The hidden layer insight: if one layer can only draw straight lines, multiple layers can warp the space itself.

The Ideas in Conflict

The AI Winter didn’t freeze everyone equally. By the early 1980s, expert systems were the dominant paradigm — commercially deployed, DARPA-funded, institutionally respectable. The approach: hire a domain expert, interview them exhaustively, encode their knowledge as explicit IF-THEN rules, run an inference engine over them. XCON, deployed at Digital Equipment Corporation in 1982, was saving the company $40M a year.¹

The competing ideas were:

Connectionism — the underground view that intelligence emerges from learning, not from hand-crafted rules. The PDP (Parallel Distributed Processing) group at UCSD — Rumelhart,David Rumelhart, first author of the 1986 backprop paper, developed Pick’s disease (frontotemporal dementia) in the mid-1990s and retired in 1998 unable to work. He died in 2011, unable to speak or write for his last decade. The Rumelhart Prize — the field’s highest honour, named after him — was first awarded in 2001 to Hinton. Rumelhart was already incapacitated. He died one year before AlexNet. McClelland, HintonHinton moved to Toronto in 1987 partly on principle: he refused DARPA funding with defence applications. He stayed 25 years. His students — LeCun, Sutskever, Krizhevsky — built the modern field. The principled retreat that looked like obscurity was the founding of an intellectual dynasty. — were its intellectual centre. Their 1986 two-volume manifesto Parallel Distributed Processing was a declaration of war.
Statistics — a separate discipline entirely. Statisticians had logistic regression, linear discriminant analysis, and maximum likelihood. They considered neural networks unprincipled black boxes with no theoretical guarantees.
Control theory — Widrow and Hoff’s ADALINE (1960) had used gradient descent on single-layer networks. The mathematics was there; nobody had extended it to multiple layers.

The question that separated these camps: does intelligence come from knowledge representation, or from learning? Expert systems said: encode what you know. Connectionists said: learn it from data.

The expert systems camp had funding, prestige, and commercial success. The connectionists had a better idea.

1986 — Backpropagation: The Chain Rule as Credit Assignment

The problem Minsky left open in 1969 was credit assignment: given an error at the output, how do you apportion blame to hidden units deep in the network? A hidden unit’s error isn’t directly observable — you only see the final output. How does the signal get back?

The Ideas Rosenblatt Was Working Against (Redux)

The block wasn’t mathematical difficulty. The chain rule is 200 years old. The block was conceptual:

Nobody believed multi-layer networks were worth training. Minsky’s verdict had poisoned the well.
Credit assignment seemed to require global coordination — some central controller surveying the whole network, deciding who was responsible. Local units computing their own blame seemed circular.

The insight that broke the impasse: the architecture is a composition of functions. Each layer applies a function to the previous layer’s output. The derivative of a composition is — by the chain rule — a product of local derivatives. Each unit only needs to know its own local gradient and the gradient flowing in from the layer above.

\[\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial w}\]

No global coordination required. The forward pass computes activations layer by layer; the backward pass computes gradients layer by layer, in reverse. The error flows backward through the same structure the activations flowed forward through.

This wasn’t a trick. It was the only tool that correctly decomposes the gradient given the compositional structure. The architecture already contained the solution. The same structure that made credit assignment seem impossible made it uniquely tractable.² What led Werbos to backprop in 1974 was Freud, not control theory — he was trying to mathematise Freud’s idea of backward credit flow through a psychic system. His Harvard committee called it “crazy, megalomaniac, nutzoid.” He survived by secretly embedding the algorithm inside a political science thesis, eating soybean soup in a Roxbury slum to conserve money. He couldn’t publish it cleanly for eight more years. Sources: thesis, 1993 interview.

⚡ Build it from scratch: Karpathy's micrograd (2.5hr) — implement backprop yourself in pure Python, from a single neuron up to a full computational graph with automatic differentiation.

The Computational Graph

Backprop is most clearly understood on a computational graph — a directed acyclic graph where each node is an operation and edges carry values (forward) or gradients (backward).

A computational graph for a two-layer network. Use the buttons to step through: forward pass (green) computes values left to right; backward pass (red) computes gradients right to left. Each node multiplies incoming gradient by its local derivative.

⚡ Step through it: Backprop Visualization — a concrete computation graph (step 1/4 … 4/4) showing exact forward values and chain-rule gradients at each node. Also: Colah's Calculus on Computational Graphs — the clearest diagram-level explanation.

⚡ Step through it: Backprop Explainer (VISxAI 2021) — interactive computational graph, step forward and backward, inspect every gradient.

Gradient Descent on a Loss Surface

Training is finding the lowest valley in a loss landscape. Gradient descent says: measure the local slope, take a small step downhill, repeat.

Learning rate α 0.20

Momentum β 0.00

Gradient descent on a loss surface. Drag to change learning rate and momentum. High α → overshooting. Momentum → faster convergence through ravines.

⚡ SGD vs Momentum vs Adam: Observable: Optimizer Trajectories — click anywhere on a 2D contour to set start; watch SGD, Momentum, RMSProp, and Adam race to the minimum. Also: Why Momentum Really Works (Distill) — the richest article on this topic, with interactive phase diagrams and parameter-space heatmaps.

⚡ 3D loss landscape: Loss Landscape Viewer — rotate and zoom a WebGL loss surface for ResNet-20, ResNet-56, VGG-16, DenseNet-121. Compare sharp vs flat minima.

Non-Linear Activations: Why Linear Layers Are Useless Stacked

If you stack linear layers: $W_2(W_1 x) = (W_2 W_1)x$ — still one big matrix. No matter how many layers, you are still cutting space with a single straight line. Depth without non-linearity collapses.

Non-linearity introduces kinks that let the network learn curved, complex boundaries. Each activation breaks the linearity at that layer, so subsequent layers compose genuinely different transformations.

Three activation functions (solid) with their derivatives (dashed). The derivative controls how strongly gradient signal passes backward through that unit. Click to switch.

The competing ideas about which non-linearity to use:

Sigmoid (1986) — biologically motivated, smooth, bounded. Problem: derivative peaks at 0.25. This mattered enormously once networks went deep.
Tanh (zero-centred sigmoid) — a small improvement. Same peak derivative problem.
ReLU — Rectified Linear Unit, $\max(0, x)$. Known in neuroscience for decades (biological neurons are roughly ReLU-shaped). Nobody thought to use it in artificial networks until Glorot & Bengio (2010) analysed why sigmoid was failing, and Nair & Hinton (2010) showed ReLU worked better in practice.³

The Vanishing Gradient: Why Depth Still Failed

Backprop solved multi-layer training. It did not solve deep networks. The field discovered this the hard way through the late 1980s and 1990s.

The Problem

Sigmoid’s derivative peaks at 0.25. Chain-multiply through 10 layers:

\[0.25^{10} \approx 10^{-6}\]

Early layers receive nearly zero gradient. They don’t learn. The network is theoretically deep but effectively shallow — only the last few layers are actually training. The further back you go, the less information arrives.Hochreiter proved this rigorously in his 1991 diploma thesis at TU Munich: for sigmoid activations, the gradient-scaling product along any path is provably less than 1, decaying exponentially with depth. Increasing weights makes it worse — as weights grow, f’ shrinks toward zero. The thesis was written in German, by an unknown student, and received almost no attention.

Gradient magnitude per layer, sigmoid vs ReLU. With sigmoid, gradients shrink exponentially toward the input. With ReLU, gradient magnitude stays roughly constant — derivative is 1 for positive inputs, so no compounding decay.

⚡ See gradient flow collapse: Vanishing Gradients Demo — choose sigmoid vs ReLU, set depth, train on Iris. Link colours and intensities show gradient magnitude per layer — blue positive, red negative. Watch early layers go dark with sigmoid. Also: TensorFlow Playground — 4 sigmoid layers on spiral → switch to ReLU and compare.

The Fixes

The fixes arrived over two decades from three separate communities:

Era	Problem	Fix	Why it works
2010	Shrinking gradients	ReLU	Derivative = 1 for positive inputs — no compounding decay
2015	Internal covariate shift	Batch Normalisation	Re-centres and rescales activations at each layer; gradient flow stabilised
2015	Deep gradient death	Residual connections⁴	Skip-connections route gradient directly to earlier layers, bypassing the chain entirely

None of these arrived together. ReLU required recognising sigmoid as the culprit. BatchNorm came from a Google team noticing training became unstable as depth increased. ResNets arose from an anomaly: adding more layers was making performance worse, not better — which shouldn’t happen if the extra layers could simply learn identity mappings.

⚡ Decision boundaries in real time: TensorFlow Playground — configure layers, activations, datasets. Watch the spiral dataset fail with sigmoid and succeed with ReLU + depth.

The solution to vanishing gradients for recurrent networks came earlier and from a different direction. Hochreiter and Schmidhuber’s LSTM (1997) introduced the constant error carousel — a memory cell with an identity self-connection (weight 1.0), so the gradient neither grows nor shrinks as it circulates through time.Hochreiter named it the “constant error carousel” because error signals could circulate indefinitely without decay. Gating units decide what enters and leaves the cell. The 1997 paper received ~15 citations/year for five years. It now has ~98,000 — one of the most cited papers in ML history.

Universal Approximation

A neural network with one hidden layer and enough neurons, using a non-linear activation, can approximate any continuous function to arbitrary precision.Proved independently by Cybenko (1989) for sigmoid activations and Hornik (1991) for general activations.

The proof is constructive: each neuron carves out a “bump” in the input space. Stack enough bumps, you can approximate any shape.

Universal approximation in action. Each neuron contributes a scaled, shifted sigmoid bump. Their sum approximates the target function. More neurons = finer approximation.

⚡ Build the intuition: Nielsen's Chapter 4 — drag weight and bias sliders on individual neurons and watch how their contributions sum to approximate any function. Also: 3D Neural Network (WebGL) — single hidden layer, 8 tanh neurons, 3D spatial input; scroll to adjust weights and watch the learned surface reshape in GPU-rendered shaders.

But notice what Universal Approximation does not guarantee:

How large the network needs to be (could be exponentially large)
That you can learn the right weights (existence ≠ findability)
That the network will generalise to new data

These three gaps — efficiency, learnability, generalisation — define the open problems of the next 30 years.

1989–1998 — LeCun and CNNs

The vanishing gradient problem stalled deep fully-connected networks. But Yann LeCun found a different path: exploit the structure of the input domain.

The competing approaches to vision in the late 1980s:

Hand-crafted features Computer vision researchers spent years designing features based on their understanding of image structure — edge detectors, HOG descriptors, SIFT. These features were engineered and then fed to a classifier. Learning was not part of the pipeline.
Fully-connected networks — treat each pixel independently. An FC network must learn that a cat’s ear looks the same whether it appears top-left or centre-right, redundantly, for every position.

LeCun’s insight: natural images have spatial structure. Patterns that matter (edges, textures, shapes) appear at multiple locations. The same detector should work everywhere — share weights across positions.

The Insight

A convolution is a filter that scans across the input, applying the same small set of weights at every position. One filter learns one kind of pattern (an edge, a curve, a texture) and finds it wherever it appears.

This buys two things:

Translation equivariance — if the pattern moves, the activation moves with it. The network doesn’t have to re-learn “cat ear in top-left” separately from “cat ear in centre.”
Parameter efficiency — instead of a weight connecting every input pixel to every hidden unit ($784 \times 500 = 392{,}000$ parameters for a tiny MNIST network), a $5 \times 5$ filter has 25 parameters and applies them everywhere.

Stack convolution layers and you get a hierarchy: the first layer learns edges, the second learns textures built from edges, the third learns shapes built from textures. Each layer abstracts over the layer before. LeCun arrived at Bell Labs in 1988 after a postdoc under Hinton in Toronto. His account of weight-sharing is undramatic: “the logical thing to do.” The 1989 paper explicitly cites Hubel and Wiesel’s 1962 discovery of oriented edge detectors in cat visual cortex. Those cells were hand-wired by evolution; LeCun’s could be learned end-to-end with backprop. That combination had not been made before.

A convolutional layer. The filter slides across the input, computing a dot product at each position. The output (feature map) shows where the pattern the filter learned appears in the input. Multiple filters in the same layer detect multiple patterns in parallel.

⚡ Inspect filters live: CNN Explainer (Georgia Tech) — runs a real CNN in the browser. Hover any neuron to see its receptive field; click layers to inspect what each filter has learned.

⚡ Draw a digit: Adam Harley's 3D CNN — draw a digit, watch activations propagate through conv → pool → fc layers in 3D.

What LeNet Actually Did

LeCun’s LeNet (1989, refined to LeNet-5 in 1998) was not a research curiosity. It was deployed in real systems: by the late 1990s, LeNet was reading handwritten cheques at ATMs across the United States, processing an estimated 10–20% of all cheques written in the country.⁵ Deep learning was running production systems while the academic field was still debating whether neural networks worked at all.

The architecture:

Input (32×32) → Conv → Pool → Conv → Pool → FC → FC → Output (10 classes)

Each conv-pool block reduces spatial size while increasing depth. The FC layers at the end combine learned features to produce a classification.

What It Left Open

CNNs worked because images have spatial structure. The same principle — exploit domain structure through architectural inductive bias — would take decades to apply elsewhere:

Text has sequential structure → RNNs, then Transformers
Graphs have relational structure → GNNs
Physics has symmetry structure → equivariant networks

The CNN insight is really a meta-insight: the right inductive bias, baked into the architecture, can substitute for enormous amounts of data and compute. This remained the dominant design principle until scale made it optional.

What This Era Left Open

The backprop era solved the training problem for shallow networks and, with CNNs, for structured domains. What remained:

Scale: LeNet worked on small images (32×32). Real-world images are orders of magnitude larger. Training on them required compute that didn’t exist yet.
Data: Supervised learning requires labels. Getting millions of labelled examples requires infrastructure that also didn’t exist yet.
Depth for unstructured domains: Fully-connected deep networks still failed on raw pixels at scale. The vanishing gradient fixes (ReLU, BatchNorm, ResNets) arrived in 2010–2015 — not yet.
Generalisation theory: Why do these networks generalise to unseen data? Especially: why do overparameterised networks generalise? This question remains genuinely open.

The door was briefly reopened in 2006 by Hinton’s deep belief network paper,Hinton’s insight was complementary priors: if the prior over a hidden layer matches the posterior the layer below computes, the “explaining away” problem cancels exactly. This made greedy layer-wise training of RBMs theoretically justified. Pretraining turned out to be unnecessary once ReLU arrived — but it bought six years of renewed research and ended the second neural network winter. which showed that greedy layer-wise pretraining could give gradient descent a starting point good enough to train deep networks. The compute and data problems were solved together in 2012 — the story of AlexNet, ImageNet, and the GPU revolution.After AlexNet’s 2012 victory — 15.3% top-5 error vs second place’s 26.2% — Hinton and students Krizhevsky and Sutskever formed DNNresearch and ran an auction. Hinton’s back condition prevents flying; bidding happened by laptop between conference sessions in Lake Tahoe. Google, Microsoft, DeepMind, and Baidu bid. Google won for ~$44M. Hinton had spent 25 years working against the current. That is the next era.

Continued in Neural Networks: The Deep Learning Revolution

Expert systems peaked commercially in the early 1980s. By 1987 the market had collapsed: the hardware (Lisp machines) was being undercut by general-purpose workstations, and the fundamental limitations of encoding knowledge manually had become apparent. Companies like Symbolics, IntelliCorp, and Teknowledge shed most of their value. The expert system winter hit just as backpropagation was published — two failures clearing the ground for something new. ↩
Backpropagation was not invented by Rumelhart, Hinton, and Williams — it was rediscovered by them. Paul Werbos derived it in his 1974 Harvard PhD thesis. David Parker rediscovered it in 1982. Yann LeCun independently derived it in 1985. Rumelhart’s 1986 Nature paper received the credit largely because of timing and institutional prominence. The idea that gets remembered is not always the first one, but the one that lands in fertile ground. ↩
ReLU-like activations were known in neuroscience modelling since the 1960s. Glorot and Bengio’s 2010 paper showed why sigmoid was failing; Nair and Hinton showed ReLU worked better empirically. Krizhevsky used it in AlexNet (2012). An activation function matching the biology was ignored by ML for 50 years. ↩
He et al. observed that a 56-layer network performed worse than a 20-layer one on training data — impossible if extra layers could learn identity mappings. The residual connection $F(x) + x$ reframes the problem: learn the residual from identity rather than the full transformation. Near-zero is easier to learn than the input itself, and the skip connection gives gradients a direct highway to early layers. ↩
AT&T Bell Labs estimates LeNet or its descendants were processing 10–20% of all cheques written in the United States by the late 1990s. Deep learning was running critical financial infrastructure while the research community had moved on to SVMs. Practical success in deployment did not translate into academic interest — academic interest tracks theory and benchmarks, not production systems. ↩

Neural Networks: Backpropagation and the Long Climb Back