Where does log likelihood come from?

Log likelihood jumps out everywhere in ML, occupying a place of prominence similar to what $\pi$ has in geometry. Is it just just a mathematical trick to make computations with probabilities tractable? It turns out there is a deeper reason: it is natural consequence of this principle that underpins all modern ML.

Learning from experience is about reducing surprise about what works.

Logic is absolute. The real world is messy. When faced with uncertainty, instead of abandoning logic, we extend it through probability. Probability is the study of correct reasoning under uncertainty.

If logic deals in truth values: 0 or 1, probability deals in degrees of plausibility: [0,1]. Instead of propositions, we study events and evidence.

So:

P(A) = 1 => I know A is true.

P(A) = 0 => I know A is false.

Intermediate values => “I don’t know; A is more or less plausible.”

Probabilities are thus generalized truth values.¹

In logic, inference is deductive: If A implies B, and A is true, then B must be true.

In probability, Bayesian inference is inductive: If A supports B with likelihood, and A is plausible, then B’s plausibility updates by Bayes’ rule.

To see an example of Bayesian inference: Imagine a probabilistic model $p_\theta(x)$ encodes your beliefs about how likely different observations $x$ are, given parameters $\theta$. When you observe a new data point $x$, you want to update those beliefs. The rule for inference here is called Bayes’ rule:

\[p(\theta \mid x) \propto p(x \mid \theta)\,p(\theta)\]

There are certain dynamics that play out when we model the world as events with uncertainty and reason about it:

For independent events, likelihoods are multiplicative. If one event is risky, and the other is risky, the risk of both occuring is muliplicative. When evidence accumulates, it’s through conjunction: Data point 1 and Data point 2 and Data point 3. The likelihood of all those observations is the product of their individual likelihoods (under independence).

Consider several examples of multiplicative probability in independent events:

Flipping 10 coins:
The probability of obtaining one specific sequence of heads and tails is
$\left(\frac{1}{2}\right)^{10}$ since each coin flip is independent with probability $\frac{1}{2}$ for each outcome.
Observing 100 i.i.d. data points:
The likelihood of observing a particular sequence is the product of the individual probabilities: $p(x_1, x_2, \dots, x_{100}) = \prod_{i=1}^{100} p(x_i)$ provided the $x_i$ are independent and identically distributed.
Joint risk of independent failures:
Suppose a machine has a $\frac{1}{10}$ chance of failure each year.
If you run two machines independently in parallel, the probability that both fail is: $\frac{1}{10} \times \frac{1}{10} = \frac{1}{100}$ <!–
False positives in independent tests:
If the chance a suspect is guilty purely by random chance is $\frac{1}{100}$,
and each of two independent tests has a $\frac{1}{10}$ false-positive rate,
then the chance both tests indicate “guilty” by random chance is: $\frac{1}{10} \times \frac{1}{10} = \frac{1}{100}$ –>

When you want to combine many uncertainties (as in predicting outcomes across multiple steps), multiplying tiny probabilities quickly becomes intractable. Humans and our tools reason better in linear terms: sums, averages, margins. Converting probability to their logarithm turns products into sums: this is numerically stable and easier to differentiate

We could also think of possible world. Whenever we define n indepenent events or features or obserations, each independent degree of freedom multiplies the number of possible worlds. Each new fact or discovered outcome slices away possible worlds. Learning both A and B means we’re narrowing further, i.e. moving into a smaller intersection.

Once an uncertain event in the world is resolved, we get some information.

How much do we measure information? And how should information from an observation behave when we reason about uncertainty?

Common events (like the sun rising) barely update your beliefs; rare events (like a desert flood) force you to rewrite your model entirely. Thus, the rarity of an event must set its information content.

Before Claude Shannon, people used to think of information in terms of semantics, i.e. a message’s meaning. But Shannon’s insight was to anchor information to how surprising an event is i.e. “how unlikely is the world we observe, given our expectations?”

Shannon reasoned that any measure of information must have these two properties:

It should be a monotonic decreasing function of probability.
Information should add up for independent events.

To see this. suppose you want to know the outcome of two independent experiments:

Flip a fair coin $\left(P = \frac{1}{2}\right)$
Roll a fair die $\left(P = \frac{1}{6}\right)$

To identify both outcomes, you’d need the same number of yes/no questions as asking them separately, then adding the answers.

It can be proven function that satisfies both properties is the logarithm:

\[I(x) = -\log P(x)\]

$-\log P(x)$ is the unique measure of information.²³

Can we learn from information?

Say a probabilistic model $p_\theta(x)$ encodes your beliefs about how likely different observations $x$ are, given parameters $\theta$.

Each observation has a surprise:

\[\text{surprise}(x) = -\log p_\theta(x)\]

The gradient of this surprise with respect to $\theta$:

\[\nabla_\theta \left( -\log p_\theta(x) \right ) = -\nabla_\theta \log p_\theta(x)\]

tells you how to adjust your model to make the observed event less surprising next time.

Imagine every possible probability distribution $p_\theta(x)$ as a point on a smooth surface: a statistical manifold. Here, $\theta = (\theta_1, \theta_2, \ldots, \theta_n)$ are the parameters (like mean and variance for a Gaussian). Changing $\theta$ moves you to a nearby distribution. So, instead of Euclidean space with coordinates $x, y, z$, you have a space of probability models with coordinates $\theta_1, \theta_2, \ldots, \theta_n$.

The gradient of $\log p_\theta(x) = \nabla_\theta \log p_\theta(x)$, also called the “score function”, is a vector in parameter space that gives the direction of steepest information gain and tells you how fast the log-likelihood changes per unit change in parameter.

“If the world looked like $x$, how should my beliefs about $\theta$ move?”

To update a probabilistic model from data, we must change parameters in the direction that reduces the surprise of observed outcomes.

This is equivalent to an infinitesimal form of Bayesian reasoning. Recall that Bayes’ rule of inference tells you how to update a full probability distribution after seeing data:

\[p(\theta \mid x) = \frac{p(x \mid \theta)\; p(\theta)}{p(x)}\]

Bayesian updating says:

\[\log p(\theta \mid x) = \log p(\theta) + \log p(x \mid \theta) - \log p(x)\]

Taking the gradient with respect to $\theta$:

\[\nabla_\theta \log p(\theta \mid x) = \nabla_\theta \log p(\theta) + \nabla_\theta \log p(x \mid \theta)\]

That’s the Bayesian gradient update:

A likelihood term that pulls toward data ($\nabla_\theta \log p(x \mid \theta)$). This is the same as the score function.
A prior term that regularizes beliefs ($\nabla_\theta \log p(\theta)$)

The posterior update direction (how beliefs about $\theta$ should move) is given by the score function plus the gradient of the prior. If we assume a flat or uniform prior $\nabla_\theta \log p(\theta) = 0$ then the only term left is the score function.

So to reduce surprise, we go in the same direction as we would go if we assumed that our distribution and did a Bayesian update on our . Gradient ascent on log-likelihood (or descent on negative log likelihood) can be viewed as a Bayesian update in the limit of infinitesimal data and flat priors, or where the posterior distribution is a point estimate. ⁴

Learning as belief revision to reduce surprise

We could think of learning from experience as reducing surprise about what works.

This is the architecture of learning: the world generates events, those events surprise us by amounts determined by our current beliefs, and the gradient tells us exactly how to revise those beliefs to be less surprised next time.

The beliefs are encoded in parameters $\theta$. The data are observations from the world. The loss is information $(−log P)$. The gradient tells us how beliefs should move to make the world less surprising.

Examples

Cross Entropy Loss

Cross-entropy loss: the standard loss for almost all modern neural networks (softmax classifiers, transformers) measures the average surprise of the model about true outcomes. For example, in classification, we minimize the negative log likelihood of the correct class (cross entropy loss):

\[L = -\log p_\theta(y_{\text{true}} \mid x)\]

Suppose a classifier predicts 90% cat, 10% dog, and the true label is “dog.” The model’s surprise is −log(0.1) ≈ 2.3 bits. If it had said 60–40, surprise drops to −log(0.4) ≈ 0.9 bits.

LLM

An LLM (say GPT) is trained to maximize next-token likelihood (or minimize negative log likelihood):

\[\min_\theta \sum_t - \log p_\theta(x_t \mid x_{<t})\]

From information theory, the training loss is the cross-entropy between the data distribution $p^*(x)$ and the model $p_\theta(x)$:

\[L = -\mathbb{E}_{p^*} \left[ \log p_\theta(x) \right]\]

Minimizing this is equivalent to minimizing the KL divergence: $\mathrm{KL}\left(p^* \;\|\; p_\theta\right)$

When GPT predicts the next token after “The sky is”, it assigns probabilities to “blue”, “falling”, “limit”, etc. If the training data says “blue”, and GPT had only given that a 20% chance, it feels 2.3 bits of surprise (−log 0.2). Training nudges its parameters so that next time, “blue” feels less surprising. Over billions of tokens, GPT gradually shapes a belief system about how the world (or language) behaves — minimizing its surprise one word at a time.

You often want to differentiate an expectation in reinforcement learning:

\[\nabla_{\theta} \mathbb{E}_{a \sim \pi_{\theta}(\cdot \mid s)}[f(a)]\]

The gradient passes through the distribution $\pi_{\theta}$, so you get:

\[\nabla_{\theta} \mathbb{E}_{\pi_{\theta}} [f(a)] = \mathbb{E}_{\pi_{\theta}} \left[ f(a) \nabla_{\theta} \log \pi_{\theta}(a \mid s) \right]\]

REINFORCE works by adjusting the policy by following the gradient of log and shifting probability mass toward actions that surprised us but worked well.

Example: A robot tries random actions to push a block. Most fail; one odd shove suddenly works. That action was both surprising and rewarding. REINFORCE says: make that action less surprising next time — increase its probability. The gradient of log-probability literally points in the direction of “repeat what surprisingly worked.”

RL turns rewarding surprises into habits of expectation.

Diffusion Models

In Diffusion Models, the training objective minimizes the KL divergence between the forward and reverse processes.

KL divergence is defined as: $D_{\mathrm{KL}}(p \,\|\, q) = \mathbb{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)} \right]$

Equivalently, $D_{\mathrm{KL}}(p \,\|\, q) = \mathbb{E}_{x \sim p} [\log p(x) - \log q(x)]$

This can also be rewritten as: $D_{\mathrm{KL}}(p \,\|\, q) = \mathbb{E}_{x \sim p} [-\log q(x)] - \mathbb{E}_{x \sim p} [-\log p(x)]$

KL divergence thus measures how much more surprised you are, on average, using model $q$ than you would be if you had perfect knowledge of the true distribution $p$.

In diffusion models, the goal is to learn the reverse process $q_\theta$ that reconstructs data from noise. The training objective is: $\min_\theta \; D_{\mathrm{KL}}(p_{\text{forward}} \,\|\, q_\theta)$

This can be interpreted as: Adjust the model parameters so that, on average, the model is no more surprised by the data than it would be if it knew the true forward distribution.

The forward process $p$ is the actual noise corruption process.
The reverse model $q_\theta$ is the model’s belief about reversing the noise.

KL divergence is the expected extra surprise due to an imperfect reverse model. Minimizing KL divergence means minimizing the model’s average surprise about what the forward process did, making the reverse prediction as unsurprising as possible.

Summary: There are many possible worlds. We reduce uncertainty when one world is revealed. Surprise: $-\log p(x)$ measures of how unexpected that world was. Cross-entropy measures expected surprise. Learning in ML is about minimizing expected surprise. We do this with the help of score function gradients: local rule for how to change beliefs, which are the differential form of Bayes’ rule when the beliefs are not a distribution, but just a point estimate.

This is the unifying framework on almost all modern ML.

Based on a logic that deals with uncertainties and events and actualities in becoming, ML and information theory orient us toward processes of wordly inference and information flow. This makes it capable of observing, acting, learning and evolving in the real world in a way classical computers could never.

Cox asked: What if we generalize this to continuous degrees of belief, while preserving logical consistency (e.g., consistency with conjunction and negation rules)? He derived that any system obeying these logical consistency constraints must be isomorphic to probability theory — i.e., plausibility behaves like probability. ↩
This is equivalent to: how many bits (or yes/no questions) would suffice to convey a piece of information. Coin flip surprise $= 1$ bit. Die roll surprise $= \log_2(6) \approx 2.58$ bits Together: coin + die outcome surprise should be $1 + 2.58 \approx 3.58$ bits. ↩
It makes information quantifiable. We can measure “bits of surprise” without touching meaning or context. That universality is why Shannon’s theory underlies everything from file compression to genetics. When you want to combine many uncertainties (as in predicting outcomes across multiple steps), multiplying tiny probabilities quickly becomes intractable. Humans and our tools reason better in linear terms: sums, averages, margins. We translate it into addition so that prediction and control become tractable. ↩
Bayes’ rule updates beliefs after seeing data. This is a global operation: it recomputes the entire distribution at once. One doesn’t move continuously through parameter space; one replaces the old belief with a new one. Here priors express inductive biases explicitly. Bayesian models can “bake in” prior knowledge to perform well even with limited data. Whereas in Modern ML, parameters are treated not as random variable with a prior distribution, but as deterministic (learned weights). And priors are implicit: L2 weight decay indirectly imposes a Gaussian prior on weights by penalizes large weight. And early stopping implicitly imposes a prior favoring smoother solutions. ↩

20 October 2025

Notes by Piyush

Where does log likelihood come from?