מילון מונחים

בחר אחת ממילות המפתח משמאל...

StatisticsMaximum Likelihood Estimation

זמן קריאה: ~40 min

So far we've had two ideas for building an estimator for a statistical functional T: one is to plug \widehat{\nu} into T, and the other—kernel density estimation—is closely related (we just smear the probability mass out around each observed data point before substituting into T). In this section, we'll learn another approach which has some compelling properties and is suitable for choosing from a parametric family of densities or mass functions.

Let's revisit the example from the first section where we looked for the Gaussian distribution which best fits a given set of measurements of the heights of 50 adults. This time, we'll include a goodness score for each choice of \mu and \sigma^2, so we don't have to select a best fit subjectively.

The goodness function we'll use is called the log likelihood function, which we define to be the log of the product of the density function evaluated at each of the observed data points. This function rewards density functions which have larger values at the observed data points and penalizes functions which have very small values at some of the points. This is a rigorous way of capturing the idea that the a given density function is consonant with the observed data.

Adjust the knobs to get the goodness score as high as possible (hint: you can get it up to about -135.8).

μ=${μ}

σ=${σ}

log likelihood = ${LL}

The best μ value is , and the best σ value is .

Definitions

Consider a parametric family \{f_{\boldsymbol{\theta}}(x) : \boldsymbol{\theta} \in \mathbb{R}^d\} of PDFs or PMFs. For example, the parametric family might consist of all Gaussian distributions, all geometric distributions, or all discrete distributions on a particular finite set.

Given \mathbf{x} \in \mathbb{R}^n, the likelihood \mathcal{L}_{\mathbf{x}}: \mathbb{R}^d \to \mathbb{R} is defined by

\begin{align*}\mathcal{L}_{\mathbf{x}}(\boldsymbol{\theta}) = f_{\boldsymbol{\theta}}(x_{1})f_{\boldsymbol{\theta}}(x_{2})\cdots f_{\boldsymbol{\theta}}(x_{n}).\end{align*}

The idea is that if \mathbf{X} is a vector of n independent observations drawn from f_{\boldsymbol{\theta}}(x), then \mathcal{L}_{\mathbf{X}}(\boldsymbol{\theta}) is small or zero when \boldsymbol{\theta} is not in concert with the observed data.

Because likelihood is defined to a product of many factors, its values are often extremely small, and we may encounter overflow issues. Furthermore, sums are often easier to reason about than products. For both of these reasons, we often compute the logarithm of the likelihood instead:

\begin{align*}\log(\mathcal{L}_{\mathbf{x}}(\boldsymbol{\theta}) )= \log(f_{\boldsymbol{\theta}}(x_{1})) + \log(f_{\boldsymbol{\theta}}(x_{2})) + \cdots + \log(f_{\boldsymbol{\theta}}(x_{n})).\end{align*}

Maximizing the likelihood is the same as maximizing the log likelihood because the natural logarithm is a monotonically increasing function.

Example
Suppose x\mapsto f(x;\theta) is the density of a uniform random variable on [0,\theta]. We observe four samples drawn from this distribution: 1.41, 2.45, 6.12, and 4.9. Find \mathcal{L}(5), \mathcal{L}(10^6), and \mathcal{L}(7).

Solution. The likelihood at 5 is zero, since f_{5}(x_{3}) = 0. The likelihood at 10^6 is very small, since \mathcal{L}(10^6) = (1/10^6)^4 = 10^{-24}. The likelihood at 7 is larger: (1/7)^4 = 1/2401.

As illustrated in this example, likelihood has the property of being zero or small at implausible values of \boldsymbol{\theta}, and larger at more reasonable values. Thus we propose the maximum likelihood estimator

\begin{align*}\widehat{\boldsymbol{\theta}}_{\mathrm{MLE}} = \operatorname{argmax}_{\boldsymbol{\theta} \in \mathbb{R}^d}\mathcal{L}_{\mathbf{X}}(\boldsymbol{\theta}).\end{align*}

Example
Suppose that x\mapsto f(x;\mu,\sigma^2) is the normal density with mean \mu and variance \sigma^2. Find the maximum likelihood estimator for \mu and \sigma^2.

Solution. The maximum likelihood estimator is the minimizer of the logarithm of the likelihood function, which works out to

\begin{align*}- \frac{n}{2}\log 2\pi - \frac{n}{2} \log \sigma^2 - \frac{(X_1-\mu)^2}{2\sigma^2} - \cdots - \frac{(X_n - \mu)^2}{2\sigma^2},\end{align*}

since \log f(X_i; \mu, \sigma^2) = \log\left(\frac{1}{\sigma \sqrt{2\pi}}\operatorname{e}^{-(X_i-\mu^2)/2\sigma^2}\right) = -\frac{\log 2\pi}{2} - \log \sigma - \frac{1}{2\sigma^2}(X_i - \mu)^2, for each i.

Setting the derivatives with respect to \mu and v = \sigma^2 equal to zero, we find

\begin{align*}\frac{\partial \log \mathcal{L}_\mathbf{X}(\boldsymbol{\theta})}{\partial v} &= -\frac{n}{2v} + \frac{1}{2v^2}\sum_{i=1}^n(X_i-\mu)^2 = 0 \\ \frac{\partial \log \mathcal{L}_\mathbf{X}(\boldsymbol{\theta})}{\partial \mu}&= -\frac{1}{2v}\sum_{i=1}^n2(X_i - \mu) = 0,\end{align*}

which implies \mu = \overline{X} = \frac{1}{n}(X_1+\cdots+X_n) (from solving the second equation) as well as v = \sigma^2 = \frac{1}{n}((X_1-\overline{X})^2 + \cdots + (X_n-\overline{X})^2) (from solving the first equation). Since there's only one critical point, and since we can observe that the log likelihood goes to -\infty as (\mu, \sigma^2) \to\infty, there must be a local maximum at this critical point.

So we may conclude that the maximum likelihood estimator agrees with the plug-in estimator for \mu and \sigma^2.

Exercise
Consider a Poisson random variable X with parameter \lambda. In other words, \mathbb{P}(X = x) = \frac{\lambda^x \operatorname{e}^{-\lambda}}{x!}.

Verify that

\begin{align*}\log \left(\mathcal{L}_{\mathbf{X}}(\lambda)\right) = \log(\lambda) \sum_{i = 1}^n X_i - n\lambda - \sum_{i = 1}^n \log(X_i!).\end{align*}

Show that it follows the maximum likelihood estimator \widehat{\lambda} is equal to the sampel mean \bar{X}, and explain why this makes sense intuitively.

Solution. When we take the derivative with respect to \lambda and set it equal to zero, we get

\begin{align*}\frac{\sum_{i = 1}^n X_i }{\widehat{\lambda}} - n = 0 ,\end{align*}

which gives us \widehat{\lambda} = \frac{\sum X_i}{n} = \bar{\mathbf{X}}, the sample mean.

Taking a second derivative gives -\frac{\sum_{i = 1}^n X_i }{\widehat{\lambda}^2}. Since this quantity is everywhere negative, the likelihood is concave. Therefore, the MLE has a local maximum at the critical point \widehat{\lambda}, and that local maximum is also a global maximum.

Example
Suppose Y = X\beta + \epsilon for i = 1, 2, \cdots, n, where \epsilon has distribution \mathcal{N}(0, I \sigma^2). Treat \sigma as known and \beta as the only unknown parameter. Suppose that n observations (X_1, Y_1), \ldots, (X_n, Y_n) are made.

Show that the least squares estimator for \beta is the same as the MLE for \beta by making observations about your log likelihood.

Solution. The log likelihood is

\begin{align*}\log \left(\mathcal{L}_{\mathbf{X}}(\beta)\right) = \sum_{i = 1}^n\left[ \log\left(\frac{1}{\sqrt{2\pi \sigma^2}}\right) - \frac{(Y_i - X_i\beta)^2}{2\sigma^2}\right].\end{align*}

The only term that depends on \beta is the second one, so maximizing the log likelihood is the same as maximizing - \sum_{i=1}^n \frac{(Y_i - X_i\beta)^2}{2\sigma^2}, which in turn is the same as minimizing \sum_{i=1}^n(Y_i - X_i\beta)^2.

Exercise
(a) Consider the family of distributions which are uniform on [0,b], where b \in (0,\infty). Explain why the MLE for the distribution maximum b is the sample maximum.

(b) Show that the MLE for a Bernoulli distribution with parameter p is the empirical success rate \frac{1}{n} \sum_{i=1}^n X_i.

(a) The likelihood associated with any value of b smaller than the sample maximum is zero, since at least one of the density values is zero in that case. The likelihood is a decreasing function of b as b ranges from the sample maximum to \infty, since it's equal to (1/b)^n. Therefore, the maximal value is at the sample maximum.

(b) The derivative of the log likelihood function is

\begin{align*}\frac{\operatorname{d}}{\operatorname{d} p}\log \frac{p^s}{(1-p)^{n-s}} = \frac{s}{p} - \frac{n-s}{1-p},\end{align*}

where s is the number of successes. Setting the derivative equal to zero and solving for p, we find p = s/n.

Properties of the Maximum Likelihood Estimator

MLE enjoys several nice properties: under certain regularity conditions, we have

  • Consistency: \mathbb{E}[(\widehat{\theta}_{\mathrm{MLE}} - \theta)^2] \to 0 as the number of samples goes to \infty. In other words, the average squared difference between the maximum likelihood estimator and the parameter it's estimating converges to zero.

  • Asymptotic normality: (\widehat{\theta}_{\mathrm{MLE}} - \theta)/\sqrt{\operatorname{Var} \widehat{\theta}_{\mathrm{MLE}}} converges to \mathcal{N}(0,1) as the number of samples goes to \infty. This means that we can calculate good confidence intervals for the maximum likelihood estimator, assuming we can accurately approximate its mean and variance.

  • Asymptotic optimality: the MSE of the MLE converges to 0 approximately as fast as the MSE of any other consistent estimator. Thus the MLE is not wasteful in its use of data to produce an estimate.

  • Equivariance: Suppose \widehat{\theta} is the MLE of \theta for f(\theta). Then the MLE for g(\theta) is g(\widehat{\theta}). This is a useful property; it states that transformation on the parameter (say, shifting the mean of a normal distribution by a number, or taking the square of the standard deviation) of interest is not an inconvenience for our MLE estimate for the parameter because we can simply apply the transformation on the MLE as well.

Example
Show that the plug-in variance estimator for a sequence of n i.i.d. samples from a Gaussian distribution \mathcal{N}(\mu, \sigma^2) converges to \sigma^2 as n\to\infty.

Solution. We've seen that the plug-in variance estimator is the maximum likelihood estimator for variance. Therefore, it converges to \sigma^2 by MLE consistency.

Exercise
Show that it is not possible to estimate the mean of a distribution in a way that converges to the true mean at a rate asymptotically faster than 1/\sqrt{n}, where n is the number of observations.

Solution. The sample mean is the maximum likelihood estimator, and it converges to the mean at a rate proportional to the inverse square root of the number of observations. Therefore, there is not another estimator which converges with an asymptotic rate faster than that.

Drawbacks of maximum likelihood estimation

The maximum likelihood estimator is not a panacea. We've already seen that the maximum likelihood estimator can be biased (the sample maximum for the family of uniform distributions on [0,b], where b \in \mathbb{R}). There are several other issues that can arise when maximizing likelihoods.

  • Computational difficulties. It might be difficult to work out where the maximum of the likelihood occurs, either analytically or numerically. This would be a particular concern in high dimensions (that is, if we have many parameters) and if the maximum likelihood function is .
  • Misspecification. The MLE may be inaccurate if the distribution of the observations is not in the specified parametric family. For example, if we assume the underlying distribution is Gaussian, when in fact its shape is not even close to that of a Gaussian, we very well might get unreasonable results.
  • Unbounded likelihood. If the likelihood function is not bounded, then \widehat{\theta}_{\mathrm{MLE}} is not even defined:

Exercise
Consider the family of distributions on \mathbb{R} given by the set of density functions

\begin{align*}\gamma \mathbf{1}_{[a,b]} + \delta \mathbf{1}_{[c,d]},\end{align*}

where a < b < c < d, and where \gamma and \delta are nonnegative real numbers such that \gamma(b-a) + \delta(d-c) = 1. Show that the likelihood function has no maximum for this family of functions.

a=${a} b=${b} c=${c} d=${d} γ=${γ}

likelihood = ${likelihood}

Solution. We identify the largest value in our data set and choose c to be \epsilon less than that value and d to be \epsilon more than it. We choose a and b so that the interval [a,b] contains all of the other observations (since otherwise we would get a likelihood value of zero). Then we can send \epsilon to zero while holding a,b and \gamma fixed. That sends \delta to \infty, which in turn causes the likelihood to grow without bound.

One further disadvantage of the maximum likelihood estimator is that it doesn't provide for a smooth mechanism to account for prior knowledge. For example, if we flip a coin twice and see heads both times, our (real-world) beliefs about the coin's heads probability would be that it's about 50%. Only once we saw quite a few heads in a row would we begin to use that as evidence move the needle on our strong prior belief that coins encountered in daily life are not heavily weighted to one side or the other.

Bayesian statistics provides an alternative framework which addresses this shortcoming of maximum likelihood estimation.

Bruno
Bruno Bruno