Maximum Likelihood Estimation

09/17/2024

In Deep Learning, most loss functions are described as minimization processes, but they are actually implementations of Maximum Likelihood Estimation. In other words, updating the neural networks' parameters \(\theta\) is equivalent to maximizing the likelihood that the observed data is generated by the model.

Let's say the dataset \(\mathcal{D} = \{ (x_i,\, y_i) \}^n_{i=1} \), all observed data are \(\text{i.i.d.}\), and the neural network model defines \(p_{\theta}(y \mid x) \). In this assumption, the likelihood is expressed by the product of the probabilities of each sample. If the probability \(p_{\theta}(y_i \mid x_i)\) for a datum becomes small, the entire product decreases. Therefore, the model must achieve high likelihoods for all data points simultaneously. \[ \begin{aligned} L(\theta; \mathcal{D}) &= \prod^n_{i=1} p_{\theta}(y_i \mid x_i) \\\\ \hat{\theta}_{\mathrm{MLE}} &= \arg \max_{\theta} L(\theta; \mathcal{D}) \end{aligned} \]
The likelihood function is a product, and since probabilities satisfy \(0 \leq p \leq 1\), multiplying many small numbers together can result in a numerically very small value. Therefore, in Deep Learning, it takes the \(\log\) for numerical stability, and the product terms are converted to a summation: \[ \ell(\theta; \mathcal{D}) = \log L(\theta; \mathcal{D}) = \sum^n_{i=1} \log p_{\theta} (y_i \mid x_i) \]
Maximizing the log-likelihood is the objective in MLE, but training models in Deep Learning is typically implemented as a minimization problem. To align the objectives, therefore, we minimize the negative log-likelihood (NLL). \[ \mathcal{L}(\theta; \mathcal{D}) = -\log L(\theta; \mathcal{D}) = -\sum^n_{i=1} \log p_{\theta} (y_i \mid x_i) \]
In practice, the form of the negative log-likelihood depends on what distribution we assume for the data. For example, if we assume a Gaussian distribution, the NLL becomes equivalent to minimizing the Mean Squared Error (MSE).

Let's consider a regression problem where the goal is to predict continuous \(y\) values. To apply MLE in this task, we first need to assume a probability distribution for \(y\). This is because likelihood is defined as a function of the model parameters that quantifies how well the observed data are explained by the model, and this requires a function corresponding to the assumed distribution that allows us to measure probabilities under that assumption.

For continuous data, the probability of observing a specific value is \(0.0\). So, we consider the probability that \(y\) is in a certain range. In the case of a Gaussian distribution, this interval probability is defined using the probability density function (pdf), and the likelihood is expressed through this pdf. The pdf of a Gaussian distribution is defined as. Here, \(x\) is a random variable, \(\mu\) is the mean of the distribution, and \(\sigma\) is the variance of the distribution. \[ pdf(x) = \frac{1}{\sigma\sqrt{2\pi}} \, \exp \left( - \frac{(x - \mu)^2}{2\sigma^2} \right) \]
If we assume that \(y\) is sampled from a normal distribution \(y \sim \mathcal{N}(f_{\theta}(x), \, \sigma^2) \), then the model output \(f_{\theta}(x)\) serves as the mean of that distribution. In this case, the likelihood for a datum can be written using the pdf as: \[ p_\theta(y \mid x) = \frac{1}{\sqrt{2\pi \sigma^2}} \, \exp \left( - \frac{(y - f_\theta(x))^2}{2\sigma^2} \right) \]
Expanding this expression by applying \(\log\) gives: \[ \begin{aligned} \log p_\theta(y \mid x) &= \log \left( \frac{1}{\sqrt{2\pi \sigma^2}} \, \exp \left( - \frac{(y - f_\theta(x))^2}{2\sigma^2} \right) \right) \\\\ &= \log \frac{1}{\sqrt{2\pi \sigma^2}} + \log \exp \left( - \frac{(y - f_{\theta}(x) )^2}{2\sigma^2} \right) \\\\ &= -\frac{1}{2} \log(2\pi\sigma^2) - \frac{ (y - f_{\theta}(x))^2 }{2\sigma^2} \end{aligned} \]
Now, substituting the probability density function into the log-likelihood \(\ell(\theta; \mathcal{D}) = \sum^n_{i=1} \log p_{\theta}(y_i \mid x_i) \) for all data samples, we obtain: \[ \begin{aligned} \ell (\theta; \mathcal{D}) &= \sum^n_{i=1} \left[ -\frac{1}{2} \log (2\pi \sigma^2) - \frac{ (y_i - f_{\theta}(x_i))^2 }{2\sigma^2} \right] \\\\ &= - n \cdot \frac{1}{2} \log (2\pi \sigma^2) - \frac{1}{2\sigma^2} \sum^n_{i=1} (y_i - f_{\theta}(x_i))^2 \end{aligned} \]
Here, our original goal is to maximize the log-likelihood for obtaining \(\hat{\theta}_{MLE}\). However, since training neural networks is typically implemented as a minimization problem, we convert this into a minimization problem: \[ \hat{\theta}_{MLE} = \arg\min_\theta \sum_{i=1}^n (y_i - f_\theta(x_i))^2 \]
In other words, applying MLE under the Gaussian distribution leads to minimizing the sum of squared errors (SSE). Since minimizing SSE or MSE yields the same solution for \(\theta\), \[ \arg\min_\theta \sum_{i=1}^n (y_i - f_\theta(x_i))^2 \;=\; \arg\min_\theta \frac{1}{n}\sum_{i=1}^n (y_i - f_\theta(x_i))^2, \]
This approach that derives the negative log-likelihood based on a chosen distribution extends naturally to other cases: for example, the Bernoulli distribution leads to Binary Cross Entropy (BCE), the Categorical distribution leads to Cross Entropy (CE), and the Laplace distribution leads to Mean Absolute Error (MAE).

References
- https://www.youtube.com/watch?v=M6Hf6R8byvM
- https://angeloyeo.github.io/2020/07/17/MLE.html