The feedforward network sees the world as a flat snapshot. The recurrent network sees it as a scroll unrolling through time — and has to hold, in a fixed-size vector, everything that matters about what it has read so far.
The Gap CNNs Left Open
CNNs solved a specific structure: spatial locality and translation invariance. When a pattern could appear anywhere in an image, the convolutional filter found it wherever it was. But this architecture assumed the input had a fixed grid structure — every pixel in its place.
Language has none of these properties. “The animal didn’t cross the street because it was too tired” — to resolve what “it” refers to, you need to remember “animal” from five words ago. Protein sequences have dependencies spanning hundreds of residues. A speech signal has phonemes depending on what came before. The relevant context is not spatially local; it is temporally distant and variable in distance.
Three things made a different architecture necessary:
- Variable-length inputs: a sentence can be 3 words or 300; a CNN with fixed filter sizes cannot handle this without padding tricks.
- Sequential dependencies: the meaning of word $t$ depends on words $1, \ldots, t-1$. An FC network applied to the full sequence treats every position independently; order becomes invisible.
- Shared temporal patterns: the word “the” before a noun means the same thing at position 3 or position 300. You want weights that are shared across time, the way convolutions share weights across space.
The architecture that addresses all three is the Recurrent Neural Network — a network with a loop.
The Recurrent Neural Network
The central idea is simple: at each time step, the network receives an input and also its own previous output. This previous output — the hidden state — is a summary of everything the network has seen so far.
\(h_t = f(W_h h_{t-1} + W_x x_t + b)\) \(y_t = g(W_y h_t)\)
The same weights $W_h, W_x, W_y$ apply at every time step. This is weight sharing through time, exactly analogous to convolutional weight sharing through space.The conceptual predecessors are Elman (1990) and Jordan (1986) networks. Jordan fed the network’s own output back as input; Elman fed the hidden state back. Both had the recurrence idea. The key shift was realising the hidden state, not just the output, carries the relevant history. Jordan networks are nearly forgotten; Elman networks became the standard model studied in cognitive science throughout the 1990s as models of syntax acquisition and language comprehension.
What It Is Not
An RNN is not a lookup table for sequences. A Markov model says the next token depends only on the last $n$ tokens — the context window is fixed and explicit. An RNN compresses all prior context into $h_t$, with no hard cutoff. The question is whether the compression is any good — and for long sequences, the answer turns out to be often no.
An RNN is also not a sequence-to-sequence model by default. A single-output RNN predicts one thing at the end of a sequence. A many-to-many RNN produces one output per input token. The encoder-decoder architecture — two RNNs, one that reads, one that generates — is a separate design that comes later.
Backpropagation Through Time
Training an RNN means computing gradients through the unrolled graph. At each time step, the same weight matrix $W_h$ appears — multiplied together across time steps. The gradient at time step $t$ involves the product:
\[\frac{\partial h_T}{\partial h_t} = \prod_{k=t}^{T-1} \frac{\partial h_{k+1}}{\partial h_k} = \prod_{k=t}^{T-1} W_h^T \cdot \text{diag}(f'(h_k))\]If $W_h$ has singular values less than 1, this product vanishes exponentially in $(T - t)$. If they’re greater than 1, it explodes. Gradient clipping (clip $|\nabla| > \theta$ to $\theta$) handles explosions cheaply. Vanishing gradients are the harder problem — the network simply cannot learn dependencies longer than its effective gradient horizon.Hochreiter proved this rigorously in his 1991 diploma thesis — the same thesis that diagnosed vanishing gradients in deep feedforward networks. The recurrent version is worse: the same weight matrix is multiplied once per time step, so a 100-step sequence has $W_h^{100}$ in the gradient. Unless all singular values of $W_h$ are exactly 1.0, the gradient either disappears or explodes. The thesis was largely ignored at the time. Hochreiter went on to write the LSTM paper with Schmidhuber in 1997.
1997 — The LSTM: A Cell with a Memory
Sepp Hochreiter and Jürgen Schmidhuber’s solution appeared in Neural Computation in 1997.1 Their diagnosis: the problem is the recurrence itself. Every time step, the hidden state is overwritten by a new non-linear transformation. There is no path through which a gradient can flow without being multiplied by something less than 1.
The fix: add a separate memory cell that has an identity self-connection — a wire that carries information forward untransformed. The cell state $C_t$ flows through time like a conveyor belt: information can be added or removed, but the default is that it passes through unchanged.
\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t\) \(h_t = o_t \odot \tanh(C_t)\)
Three gates control what happens to the cell state:
| Gate | What it does | When it fires |
|---|---|---|
| Forget $f_t$ | Erase parts of the cell state | “Forget the previous subject when a new one starts” |
| Input $i_t$ | Write new content to cell state | “Remember this new number/name/entity” |
| Output $o_t$ | Read from cell state into hidden state | “Output the current subject for pronoun resolution” |
Each gate is a sigmoid over a learned linear function of $h_{t-1}$ and $x_t$. Each produces a vector of values in $[0, 1]$, applied element-wise. Gates are soft: they don’t switch sharply but continuously interpolate.
The Constant Error Carousel
Hochreiter’s name for the key mechanism: a constant error carousel. The gradient of the cell state with respect to itself, through the forget gate, is exactly $f_t$ — which can be 1 if the gate is open. When $f_t \approx 1$, the gradient flows back through the cell state unchanged: $\partial C_t / \partial C_{t-1} = f_t \approx 1$. No exponential decay.
The forget gate is the key to long-range memory. If the network learns to keep $f_t \approx 1$ for a particular dimension of the cell state, that dimension carries its value indefinitely — hundreds of time steps — with a gradient that doesn’t vanish.This is structurally the same insight as the ResNet skip connection: provide a path through which the gradient can flow without being multiplied by a learned weight. In ResNets the skip connection is fixed at +1. In LSTMs the “skip” is learnable via the forget gate — which is more powerful but also more complex to train. Both solve the gradient highway problem by different means.
What It Is Not
The cell state is not a RAM register. It doesn’t store symbols that can be looked up by address. It stores a continuous-valued vector, mixed and blended at every step by soft gates. What the LSTM learns to remember is whatever pattern of activations is most useful for predicting the next output — often interpretable as “the current subject,” “the open parenthesis count,” or “the tense of the clause,” but these are emergent, not explicitly designed.
The misconception: LSTM completely solves vanishing gradients for all sequence lengths. It greatly extends the effective memory, but it still degrades over very long sequences. The forget gate can still close ($f_t \approx 0$), erasing memory. And the hidden state output at each step still goes through $\tanh$ and the output gate. For sequences of thousands of steps, LSTMs still struggle — which is part of what motivated attention and eventually transformers.
GRU: The Simplified Alternative
Kyunghyun Cho’s 2014 Gated Recurrent Unit merges the cell state and hidden state into one, with two gates instead of three:
- Update gate $z_t$: combines the forget and input gates — $z_t$ controls how much of the old hidden state to keep and how much to replace.
- Reset gate $r_t$: controls how much of the previous hidden state to use when computing the new candidate.
Fewer parameters, faster to train, comparable performance on most benchmarks. LSTMs are more expressive in principle; GRUs are more efficient in practice. The community split on which to use and largely never resolved the question — transformers made it moot.
Karpathy’s Experiment: What RNNs Actually Learn
In 2015, Andrej Karpathy trained a character-level LSTM on large text corpora and let it generate text one character at a time.Karpathy was a PhD student under Fei-Fei Li at Stanford — the same Fei-Fei Li who built ImageNet. The char-rnn experiments were a side project, written in Lua/Torch over a few evenings. The blog post about them received more attention than most peer-reviewed papers from that year. Karpathy later said the point was not the generated text itself but what the internal representations revealed about what RNNs had learned to track.
The results were striking — not because the output was good, but because of what inspecting the hidden states revealed:
- A neuron tracking open/close quote depth — its activation went up when text was inside quotation marks and down when it exited.
- A neuron tracking line position — its activation tracked whether the model was near the start or end of a line of code.
- A neuron tracking if-block nesting depth — activation correlated with the indentation level in C source code.
None of these properties were specified in the objective. The model was trained only to predict the next character. It learned these tracking properties because they were predictive — you can’t close a quote without having opened one; you can’t end an if-block without knowing you’re inside one.
2014 — Seq2Seq: Encoding to Decoding
Machine translation requires a different architecture: the input sequence (one language) must be entirely read before the output sequence (another language) begins. Word order differs between languages; you cannot translate word-by-word.
Sutskever, Vinyals, and Le’s 2014 paper introduced the encoder-decoder (seq2seq) architecture:Ilya Sutskever — the same Sutskever who built AlexNet with Krizhevsky and Hinton. He was now at Google Brain. The seq2seq paper appeared in NeurIPS 2014 and had an immediate practical impact: it was deployed in Google Translate, replacing the phrase-based statistical MT system that had been the state of the art for a decade.
- Encoder: an LSTM reads the entire source sentence, producing a final hidden state — the context vector $c$ — that encodes everything the decoder will need.
- Decoder: a second LSTM, initialised with $c$, generates the target sentence token by token.
\(c = \text{Encoder}(x_1, \ldots, x_T)\) \(y_t = \text{Decoder}(c, y_1, \ldots, y_{t-1})\)
The analogy: the encoder compresses the source sentence into a single thought; the decoder uncompresses it into the target language.
The irony: the better the compression, the more is lost. A single fixed-size context vector must carry all the information from a sentence of arbitrary length. For short sentences this works. For long ones, the encoder forgets early words by the time it finishes reading. The bottleneck is not the architecture — it is the context vector itself.
2015 — Attention: Breaking the Bottleneck
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio’s 2015 paper made a single change that transformed the field: instead of compressing the source into one vector, let the decoder look back at all encoder hidden states, with a different weight for each.
At each decoder step $t$, compute an attention weight $\alpha_{t,s}$ for each source position $s$:
\[\alpha_{t,s} = \frac{\exp(e_{t,s})}{\sum_{s'} \exp(e_{t,s'})} \quad \text{where } e_{t,s} = a(h^{\text{dec}}_t, h^{\text{enc}}_s)\]The context vector at each step is now a weighted sum of all encoder states:
\[c_t = \sum_s \alpha_{t,s} h^{\text{enc}}_s\]The alignment function $a$ is itself a small neural network, learned jointly with the encoder and decoder. Attention is learned, not designed.
The result: when translating “the cat sat on the mat” and generating the word corresponding to “chat” in French, the decoder can look directly at the encoder’s representation of “cat” — wherever it appears in the source sequence — regardless of sequence length.
What Attention Actually Is
Attention is a soft, differentiable lookup. Given a query (the decoder state), keys (the encoder states), and values (also the encoder states), it:
- Scores each key against the query (how relevant is this source position?)
- Normalises scores to weights via softmax
- Returns a weighted sum of values
This query-key-value framing turns out to be the general mechanism behind transformers. In transformers, every token can attend to every other token — not just decoders attending to encoders. The “attention” in “self-attention” is the same computation, applied within a single sequence.
The irony: attention was introduced as a fix for a specific RNN failure mode. Within two years, Vaswani et al. would show that if you had attention, you didn’t need the RNN at all.
Real-World Applications: What the Era Built
The RNN era (roughly 2013–2017) produced systems that worked in production:
Speech recognition: deep bidirectional LSTMs (reading forward and backward simultaneously) replaced hand-crafted acoustic models. The 2014 DeepSpeech paper from Baidu showed that a sufficiently large LSTM, trained end-to-end on raw audio spectrograms with enough data, matched the best engineered speech pipelines. The same principle: raw input, end-to-end learning, scale.
Machine translation: seq2seq + attention replaced the phrase-based statistical MT systems (Moses, etc.) that had been state of the art since the early 2000s. Google deployed GNMT (Google Neural Machine Translation) in 2016, replacing their existing system overnight for eight languages. The quality jump was the largest single improvement in Google Translate’s history.
Language modelling: character-level LSTMs generated text that, for the first time, was locally coherent — grammatical sentences, plausible paragraph structure, even approximately correct code. Not because the model understood meaning, but because language has enough local statistical structure that a good hidden-state compression captures it.The distinction between “locally coherent” and “globally coherent” is sharp. A character-level LSTM generates text that looks good for a few sentences, then drifts — it has no mechanism to maintain a consistent argument, narrative, or code logic over more than ~100 characters. It knows that after “the cat sat on the” comes something consistent with English noun phrases, but it doesn’t know that the story is about a cat specifically who has been introduced three paragraphs ago. Long-range consistency requires long-range gradient flow, which the LSTM only partially solves.
Handwriting generation: Alex Graves’s 2013 handwriting generation paper showed that a mixture-density LSTM could generate smooth, realistic handwriting — predicting the next pen position as a probability distribution over 2D coordinates. The model had to predict not just the stroke, but when to lift the pen. The generated samples were indistinguishable from human handwriting.
What This Era Left Open
The RNN era established that sequential structure could be learned end-to-end. Three gaps remained:
The parallelism problem: an RNN is fundamentally sequential. To compute $h_t$, you need $h_{t-1}$. You cannot process time steps in parallel. This is not a software limitation — it is intrinsic to the recurrence. For long sequences, training is slow. For very long sequences, it is prohibitive. GPUs are built for parallelism; RNNs cannot use it. This is the decisive computational argument for transformers.
The long-range dependency problem: LSTMs extend the effective memory but do not solve it. The cell state can carry information for tens or hundreds of steps; across thousands of steps, it still degrades. Attention with a fixed number of decoder steps looks at the full encoder, but the encoder itself still runs sequentially and can forget. Full pairwise attention — every position attending to every other — solves this, but requires $O(n^2)$ computation.
The inductive bias question: RNNs assume sequential, causal structure — position $t$ depends only on positions $1, \ldots, t-1$. This is right for language generation (you can’t use future words to predict the current one) but wrong for many other tasks where bidirectional context helps. The transformer’s self-attention is non-causal by default — every position sees every other — and causality is imposed by masking when needed.
The answer to all three arrived in 2017. Vaswani et al.’s “Attention Is All You Need” removed the recurrence, kept the attention, and showed that the parallel transformer could train faster, attend over longer sequences, and match or exceed RNN performance on translation. The RNN’s role as the sequence model was over in roughly three years.
Continued in Neural Networks: Attention Is All You Need
-
Hochreiter and Schmidhuber’s 1997 LSTM paper received roughly 15 citations per year for its first five years. Schmidhuber spent the next two decades documenting, carefully and at length, that the LSTM was his lab’s work and that the field had not properly credited it. He was not wrong about the credit; he was unusual in the vigour with which he pursued it. The paper now has approximately 98,000 citations — one of the most-cited papers in machine learning history. Hochreiter received the Milner Award (British Royal Society) in 2024, the same year Hinton received the Nobel Prize. ↩