The Birth of Neural Networks

There are artifacts you can build out of sand that convert energy into intelligence. How is that possible? Where does the intelligence come from?


The Idea: mental phenomena is emergent from interconnected networks of simple and often uniform units, for example, neurons connected by synapses as in the brain.


What Came Before

Rene Descartes had separated the mind from the body in his mind-body dualism: mind and body are made of different stuff. Dualism made the question of consciousness officially out of bounds for empirical investigation. The mind was simply not the kind of thing that science was supposed to explain. By early 20th century, the dualism led to two competing views: the mechanists, who believed living systems obeyed the same physical laws as everything else, and the vitalists’ who held that life required some irreducible non-physical principle. The mind remained last stronghold of the vitalists.

In early 20th century, Gottlob Frege and Bertrand Russell sought to derive all mathematics from pure logic. While Boole showed how logic could be expressed through algebraic formulas (Boolean algebra), Frege and Russel extended this to quantifiers (“for all,” “there exists”) and complex propositional logic, providing a more expressive formal language. While they succeeded in deriving basic arithmetic, in 1931 Gödel proved that in any logical system complex enough to do math, there will always be true statements that cannot be proven using the rules of that system.

In 1937, Shannon wrote what is often called the most important master’s thesis in history (“A Symbolic Analysis of Relay and Switching Circuits”). He made a connection between the “abstract” logic of Boole and Russell could be mapped onto physical switches: if a switch is Open, it’s a 0; if a switch is closed, it’s a 1. If you put two switches in a row (series), you have an AND gate. If you put them side-by-side (Parallel), you have an OR gate. He went on to prove that it should also be possible to use arrangements of relays to solve Boolean algebra problems. Shannon’s thesis became the foundation of practical digital circuit design. The utilization of the binary properties of electrical switches to perform logic functions is the basic concept that underlies all electronic digital computer designs.

The neuron doctrine: The dominant biological view, the reticular theory1, held that the nervous system was one continuous net, a single fused web with no internal structure. Against this, Cajal argued the brain was made of discrete individual cells, each separated by a gap. Discrete cells meant a circuit, not a fog. It meant the brain had structure that could in principle be described.

The Pioneer The “Substrate” (The Stuff) The Insight The Result
Claude Shannon Electrical Relays Switches are just Boolean Logic in physical form. Digital Computers
Walter Pitts Biological Neurons Nerve firings are just Boolean Logic in biological form. Artificial Intelligence

Just as Shannon realized that it didn’t matter if a switch was made of copper, vacuum tubes, or wood—as long as it could represent On/Off: Pitts (along with McCulloch) realized it didn’t matter that a brain was “wet” and made of cells. If a neuron only fires when it reaches a certain threshold of inputs, it is performing computation. Shannon proved that Machines could be Logical. Pitts proved that Biology was already Logical.

The Competing Idea: “The Ghost in the Machine”. By saying the brain is just a “Net” of logical gates (The McCulloch-Pitts Neuron), they effectively “killed the ghost.” They suggested that if you build a big enough network of these gates, you don’t just get a calculator; you eventually get consciousness.


1943 — McCulloch & Pitts: The Brain as Logic Machine

The Insight

Warren McCulloch2 and Walter Pitts3 made a connection between logic and neuroscience.4: The key move was noticing that a neuron’s threshold function and a logical proposition have identical structure:

  • If neurons were discrete units that fire or don’t fire, they were binary. Binary operations could express boolean logic. So you could treat a neuron like a logic gate. Once you see that, one neuron = one logical primitive.The threshold→binary pattern is the same everywhere it appears: biological action potentials obey the all-or-none law — a nerve fires at full amplitude or not at all, exactly what M-P abstracted. Shannon’s relay switches are the same gate in copper. The substrate is irrelevant; the threshold logic is identical.
  • A logical sentence says: “this is true if and only if these conditions are met.” A neuron says: “I fire if and only if my inputs exceed my threshold.” These are the same operation, expressed in different vocabularies. A threshold is a truth value.
  • From there, everything follows from what Russell and Whitehead had already proved. Any logical statement can be built from AND, OR, NOT. Any computation can be expressed as a logical statement. Therefore: any computation can be built from neurons. A network of neurons can calculate anything a Turing machine can calculate. The geometry of connections is the software. The brain becomes a “Logic Machine.”, express any finite logical statement, expressing any logical thought (Boole published Laws of Thought in 1854).

A neuron “realizes” a logical sentence if its firing pattern (On/Off) matches the True/False values of that sentence I compute, therefore I think” (McCulloch/Pitts) inverted Descarte’s dualism: I think, therefore I am”.

Real neurons are messy: Real neurons don’t just send 1s and 0s; they send chemical pulses at different frequencies, they have hormones, and they “leak.” they fire in graded potentials, variable thresholds, continuous time, chemical soup. McCulloch and Pitts stripped all of that.

But three problems stood between the insight and a rigorous proof:

  • Time. Logic is timeless; neurons fire in sequence. They introduced discrete time steps and showed the equivalence holds at each step, sidestepping continuous neural dynamics.
  • Inhibition. AND and OR follow naturally from excitatory inputs summing to a threshold. NOT requires a different mechanism: an input that blocks firing rather than promotes it. Real neurons have inhibitory synapses, and they showed these map cleanly to logical negation.
  • Cycles. Networks with feedback loops can represent memory and state. Do the logical equivalences still hold when a neuron’s output feeds back into its own input? They proved they do, with one caveat: loops that would generate logical paradoxes must be excluded. A cyclic M-P network turned out to be equivalent to a finite automaton — a result Kleene formalised in 1951. McCulloch and Pitts thought cycles were the interesting part, the mechanism for memory and thought. The field spent the next 40 years ignoring them. They came back as Hopfield networks (1982), then LSTMs (1997), then the attention mechanism in transformers: where cycles and memory are, again, everything.5
x₁ x₂ x₃ inh w₁ w₂ w₃ Σ ≥ θ (threshold) 0 / 1 fires if Σ ≥ θ and not inhibited
The McCulloch–Pitts neuron. Excitatory inputs (blue) sum up; if the total meets threshold θ and no inhibitory input (red dashed) is active, the neuron outputs 1.

What It Is Not

The reticular theory assumed a continuous fog — no discrete units, no describable structure. Gestalt psychology said cognition couldn’t be decomposed into elements at all: the whole is not the sum of its parts. Both denied that the parts had independent structure. M-P’s entire move was to insist they did — and that insisting on discreteness was what made computation possible.

What It Left Open, and Where It Led

The connections were fixed and hand-wired by the designer. There was no learning. The open question: Can the connections be learned from experience?

The two threads M-P opened: (1) the feedforward thread — fixed logical circuits as a proof of representational power, which Rosenblatt would turn into a learning machine; (2) the cycles thread — feedback loops as the mechanism for memory and state, which the field ignored for 40 years before it returned as Hopfield nets (1982), LSTMs (1997), and transformers’ attention (2017).


1949 — Hebb: The First Learning Rule

Donald Hebb’s 1949 book proposed a simple principle: “Neurons that fire together, wire together.”

Formally: $\Delta w_{ij} \propto x_i \cdot x_j$

weak A B Before co-activation strong A B After co-activation
Hebbian learning: the synapse from A to B strengthens when both fire simultaneously.

Hebb’s rule has no error signal. It adjusts based on what fires together, not whether the network was right. It can build associations but cannot correct itself. For correction, you need a signal from the outside world saying you were wrong. That gap is what Rosenblatt would fill nine years later.Easy to confuse: Hebb’s rule and backpropagation both update weights, but they solve completely different problems. Hebb is unsupervised and local — it strengthens correlations with no knowledge of the output. Backprop is supervised and global — it uses a scalar error signal and the chain rule to assign blame backward through every layer. They are not interchangeable.

Because it needs no labels, Hebb’s rule picks up the statistical structure of inputs on its own — the seed of unsupervised learning. Its descendants are autoencoders, PCA, and modern self-supervised models. The supervised thread (Rosenblatt, backprop) and the unsupervised thread (Hebb) run in parallel through the entire history of the field.


1958: Rosenblatt’s Perceptron

Mcculloch-Pitts proved that neurons can compute. Rosenblatt asked: can they learn to compute?

Three competing frameworks shaped what “learning” was even supposed to mean in 1958:

  • Behaviourism held that learning was stimulus-response conditioning: no internal representation needed, just input-output associations shaped by reward and punishment. It said nothing about the internal structure of the learner.
  • Gestalt psychology argued that perception was holistic and you couldn’t decompose it into discrete units. The whole was not the sum of its parts. A unit-based network was, from this view, a category error.

Rosenblatt6 was a psychologist, not a logician. He was thinking about perception, specifically, how the visual cortex learns to recognise things.

The perceptron was not conceived as a computer. It was first built as physical hardware.The Mark I Perceptron (1957): a 20×20 grid of 400 photocells as the “eye”; 512 association units in a middle layer; 8 output units. Connection weights were stored as the physical rotation angles of 512 motor-driven potentiometers — essentially motorised volume knobs. When the machine made a mistake, electric motors physically turned the knobs to adjust the weights. The S→A connections were wired randomly and fixed; only the A→R weights learned. It filled an entire room. Wikipedia

The random S→A wiring was deliberate: Rosenblatt believed the retina connects randomly to the visual cortex. Random projections into a high-dimensional space make many problems linearly separable that weren’t before — an insight that resurfaced decades later in kernel methods and reservoir computing. In his 1962 book Principles of Neurodynamics, Rosenblatt tried going further with four-layer networks, using a time-dependent non-supervised rule to modify lower-layer weights before switching to error-correction. He couldn’t make it work. Hinton cracked the same problem in 2006 with greedy layer-by-layer unsupervised pretraining — the idea that unlocked deep networks. Rosenblatt had identified the right question 44 years earlier.

The gap between the two was the error signal — the same structural problem a judge faces: you can see the final verdict was wrong, but which witness (which hidden weight) do you correct? Hebb’s rule adjusted weights based on co-activation but with no feedback from the world about whether the result was right. Rosenblatt imported the correction idea from control theory and behaviourist reinforcement: if the output is wrong, use the mistake to compute which direction to push the weights.

What makes this work geometrically: the perceptron computes $w \cdot x + b$. The error $(y - \hat{y})$ times the input $x$ gives you exactly the direction in weight space that would have made the output more correct. The mistake is the gradient. You don’t need calculus — the structure of the dot product hands it to you.

The Geometry

Each input is a dimension. Two inputs → each training example is a point in 2D space. The weight vector defines a line — a knife cutting space in two.The bias term $b$ lets the boundary slide freely rather than being forced through the origin. Full computation: $w \cdot x + b \geq 0$.

Learning = rotating that knife until it correctly separates all the data.

0
Drag the slider or press Animate to watch the decision boundary converge. Blue = class 1, red = class 0.

The Perceptron Learning Rule:

\[w_{i}^{\text{new}} = w_{i}^{\text{old}} + \alpha\,(y - \hat{y})\,x_i\] \[b^{\text{new}} = b^{\text{old}} + \alpha\,(y - \hat{y})\]

When right, $(y - \hat{y}) = 0$ and nothing changes. When wrong, weights shift toward the correct orientation.

The Convergence Theorem

Rosenblatt proved: if a linearly separable solution exists, the rule will find it in finite steps. The proof is a neat squeeze argumenttwo quantities grow at different rates, and their ratio is forced to 1.

Per mistake, two things happen to the weight vector $W$:

  1. Alignment with the optimal $W^\star$ grows linearly — each correction pushes $W$ toward $W^\star$ by a fixed minimum amount, so $W \cdot W^\star$ increases by at least a constant per mistake.
  2. The magnitude $|W|$ grows as $\sqrt{\text{mistakes}}$ — each update adds at most a bounded amount to $|W|^2$, so $|W|$ grows no faster than the square root of the mistake count.

Their ratio — the cosine $\dfrac{W \cdot W^\star}{|W|\,|W^\star|}$ — has a numerator growing linearly and a denominator growing as $\sqrt{t}$. The cosine must eventually reach 1. The knife is guaranteed to converge.The proof structure is a Lyapunov argument: define a quantity that must monotonically improve and is bounded, then conclude convergence — the standard tool in control theory for proving system stability. The specific geometric step (bounding a cosine by 1 via the inner product) is Cauchy-Schwarz applied dynamically. Rosenblatt absorbed the Lyapunov style from the cybernetics environment around him; the same skeleton appears in online learning theory (Littlestone’s mistake-bound proofs, 1988) and, much earlier, in von Neumann’s minimax theorem (1928).

The squeeze argument made visible. As mistakes accumulate, alignment (blue) grows linearly while magnitude (orange) grows as √t. Their ratio — the cosine — is forced toward 1. Press Run to animate.

What It Left Open, and Where It Led

The linearity constraint: the knife can only be a straight line. If no straight line separates the data, the algorithm doesn’t just fail to converge — it cycles forever. Common confusion: the convergence theorem guarantees finding a separator, not the best one. Any hyperplane that correctly divides the data counts. This is why the perceptron doesn’t generalise well even when it converges — it finds the first valid boundary, not the widest one. That gap is what the Support Vector Machine (1990s) fixed, using the same geometric intuition but optimising for maximum margin.

The perceptron’s learning rule was the first proof that weights could be found automatically from data. That idea of error-driven correction seeded everything in supervised learning. But it only worked for one layer. Could it work for more layers? The question it left open was exactly the one Minsky and Papert would weaponise.


1969 — Minsky & Papert: The Crisis

The Ideas in Conflict

By 1969, two positions were hardening:

  • The connectionist view: neural networks, learning from examples, distributed representations. Rosenblatt’s camp.
  • The symbolic AI view: explicit rules, logic, hand-crafted knowledge representations. GOFAI. Minsky’s camp.

Both agreed that multi-layer networks could represent anything. McCulloch Pitts had proved that in 1943. The dispute was whether that representational power was reachable by any learning rule.

Minsky and Papert7 argued credit assignment: apportioning blame to hidden weights from an output error was intractable.8

The Theorem: A single-layer perceptron can only solve linearly separable problems.

The diagnostic case is XOR: output 1 when inputs differ, 0 when they match.

The four XOR cases:

x₁x₂output
000
011
101
110
x₁ x₂ (0,0) (0,1) (1,0) (1,1) No line separates ● from ○
The two 1s sit at diagonally opposite corners. No straight line can separate blue from grey — not hard, geometrically impossible.

The crucial distinction, easy to miss:

  • M-P proved: you can build XOR with a hand-wired multi-layer circuit.
  • Minsky & Papert proved: a single-layer network cannot learn the weights for XOR.

These are different claims. The first is about representational capacity. The second is about learnability. Minsky and Papert were right about single-layer networks.

The implicit claim that multi-layer networks couldn’t be trained either was never proved. It was an open problem, stated as a verdict.

XOR is not showing that neural networks can’t learn nonlinear patterns. It’s showing that a single-layer network can’t even represent XOR — no weights exist that would make it correct, because no line can separate the points. Add one hidden layer and the problem vanishes: the hidden layer bends the input space until the classes are linearly separable.

The Irony

The credit assignment problem asks: how do you propagate an error signal backward through multiple layers to assign blame to hidden weights?

The answer was already implicit in the architecture. Multi-layer networks are compositions of functions. The chain rule from calculus decomposes the gradient of any composition layer by layer. This is exactly what was needed. The same compositional structure that made the networks hard to train made the gradient decomposable. The problem contained its own solution.

The ideas for backpropagation existed earlier9 but didn’t land until the conditions were right.

Where the Threads Went

Three threads survived the winter independently and reconverged in the 1980s–90s:

  • The supervised learning thread (Rosenblatt → Werbos → Rumelhart/Hinton, backprop 1986);
  • The unsupervised thread (Hebb → Hopfield’s associative memory 1982, which showed that recurrent networks with Hebbian-style weights could store and retrieve patterns);
  • The cycles thread (M-P → Kleene → Hopfield → LSTMs → transformers). None of them died. They went underground and came back as the ingredients of deep learning.

How it was solved is the story of the next era — continued in Neural Networks: The Deep Learning Revolution.


  1. Cajal and Golgi shared the 1906 Nobel Prize — and Golgi used his acceptance speech to publicly deny Cajal’s findings. The irony was complete: Cajal had used Golgi’s own silver staining technique to disprove Golgi’s theory. Golgi had the better tools; Cajal had the better eyes. They stood at the same podium and contradicted each other. Golgi never conceded. The neuron doctrine won not because its opponents were persuaded, but because they died. 

  2. McCulloch was a 40-year-old established neuropsychiatrist when he encountered Pitts, a homeless teenager, and brought him into his household. Their collaboration lasted years and produced foundational work. It ended in personal catastrophe: Norbert Wiener’s wife spread a rumor that Pitts had made advances toward Wiener’s daughters. Wiener cut off contact with McCulloch’s entire group without explanation. Pitts, who had treated Wiener as a father figure, was devastated. He burned his unpublished manuscripts, retreated into alcoholism, and died at 46. McCulloch never learned the true cause of the rupture until after Pitts was dead. 

  3. Pitts was a runaway teenager living on the streets of Chicago when McCulloch found him. He had taught himself Greek, Latin, and Principia Mathematica by age 12. He never finished high school, never earned a degree, and wrote one of the most influential papers in the history of neuroscience at age 18. 

  4. The paper didn’t emerge from isolation. McCulloch and Pitts were embedded in the Macy Conferences (1946–1953), an extraordinary interdisciplinary salon that brought together Norbert Wiener, John von Neumann, Claude Shannon, Gregory Bateson, and Margaret Mead. Neuroscience, computing, information theory, and social science were being invented simultaneously in the same rooms. The concept of feedback and control — cybernetics — was the shared air. The McCulloch-Pitts paper was the output of an entire intellectual ecosystem, not two people working alone. 

  5. The cycles thread split three ways. (1) Kleene (1951) proved that M-P networks with cycles characterise exactly the regular languages — the same class recognised by finite automata. This became a foundational result in formal language theory, still in every automata textbook, usually without crediting its neural net origin. (2) The result fed into broader conversations about the Church-Turing thesis — what machines can in principle compute. (3) The thread was then largely dropped by the neural network field, which pivoted to learning via feedforward networks where error signals are easier to define. Cyclic networks came back as Hopfield networks (1982), then LSTMs (1997), and finally the attention mechanism in transformers — where memory and temporal reasoning are, again, everything. 

  6. Rosenblatt was a showman. He held press conferences. The New York Times in 1958 reported the Navy had built a machine that “will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” Rosenblatt didn’t exactly say all of this, but he didn’t discourage it either. The hype created the target. When Minsky and Papert published Perceptrons in 1969, the takedown landed harder because expectations had been so inflated. Rosenblatt died in a boating accident in 1971 — the year before the AI Winter fully set in. He never saw the vindication. 

  7. The politics behind Perceptrons (1969) are still debated. Minsky was at MIT; Rosenblatt at Cornell. They were rivals. The book’s proof was technically correct — single-layer perceptrons cannot learn XOR — but it was widely read as condemning neural networks entirely, which it did not actually claim. The field largely believed this reading. The effect on funding was real and immediate. DARPA pulled support. An entire research programme collapsed. Whether this was a dispassionate scientific critique or a territorial hit on a competing paradigm, the book shaped what got funded for the next decade, which shaped what got discovered. 

  8. The AI Winter that followed was institutional, not intellectual. DARPA cut neural network funding almost entirely after Perceptrons. Universities stopped hiring in the area. Graduate students were warned away from the topic — it was career poison. The handful of researchers who kept working (Hinton in Toronto, LeCun in Bell Labs, Schmidhuber in Switzerland) did so at the margins, on shoestring budgets, against professional advice. The ideas didn’t die; they went underground. What the funding structure decided was unproductive turned out to be the most productive direction in the history of machine learning. 

  9. Paul Werbos derived backpropagation in his 1974 Harvard PhD thesis. He got essentially no response — neuroscientists didn’t see the relevance, computer scientists weren’t reading theses in that corner of the library. David Parker rediscovered it independently in 1982. Yann LeCun derived it again in 1985. None of these earlier derivations had institutional weight behind them. Rumelhart, Hinton, and Williams published in Nature in 1986 — the same mathematics, but backed by the right names, at the right moment, in the right journal. The idea that gets remembered is the one that lands in fertile ground. The same discovery, made in the wrong soil, disappears.