Maximum likelihood

Maximum likelihood estimation (MLE) is a popular statistical method used to make inferences about parameters of the underlying probability distribution of a given data set.

The method was pioneered by geneticist and statistician Sir Ronald A. Fisher between 1912 and 1922 (see external resources below for more information on the history of MLE).

Prerequisites
The following discussion assumes that the reader is familiar with basic notions in probability theory such as probability distributions, probability density functions, random variables and expectation. It also assumes s/he is familiar with standard basic techniques of maximising continuous real-valued functions, such as using differentiation to find a function's maxima.

The philosophy of MLE
Given a probability distribution $$D$$, associated with either a known probability density function (continuous distribution) or a known probability mass function (discrete distribution), denoted as $$f_D$$, and distributional parameter $$\theta$$, we may draw a sample $$X_1, X_2, ..., X_n$$ of $$n$$ values from this distribution and then using $$f_D$$ we may compute the probability associated with our observed data:


 * $$\mathbb{P}(x_1,x_2,\dots,x_n) = f_D(x_1,\dots,x_n \mid \theta)$$

However, it may be that we don't know the value of the parameter $$\theta$$ despite knowing (or believing) that our data comes from the distribution $$D$$. How should we estimate $$\theta$$? It is a sensible idea to draw a sample of $$n$$ values $$X_1, X_2, ... X_n$$ and use this data to help us make an estimate.

Once we have our sample $$X_1, X_2, ..., X_n$$, we may seek an estimate of the value of $$\theta$$ from that sample. MLE seeks the most likely value of the parameter $$\theta$$ (i.e., we maximise the likelihood of the observed data set over all possible values of $$\theta$$). This is in contrast to seeking other estimators, such as an unbiased estimator of $$\theta$$, which may not necessarily yield the most likely value of $$\theta$$ but which will yield a value that (on average) will neither tend to over-estimate nor under-estimate the true value of $$\theta$$.

To implement the MLE method mathematically, we define the likelihood:


 * $$\mbox{lik}(\theta) = f_D(x_1,\dots,x_n \mid \theta)$$

and maximise this function over all possible values of the parameter $$\theta$$. The value $$\hat{\theta}$$ which maximises the likelihood is known as the maximum likelihood estimator (MLE) for $$\theta$$.

Discrete distribution, discrete and finite parameter space
Consider tossing an unfair coin 80 times (i.e., we sample something like $$x_1=\mbox{H}$$, $$x_2=\mbox{T} $$ , $$\ldots $$ , $$x_{80}=\mbox{T}$$ and count the number of HEADS $$\mbox{H}$$ observed). Call the probability of tossing a HEAD $$p$$, and the probability of tossing TAILS $$1-p$$ (so here $$p$$ is the parameter which we referred to as $$\theta$$ above). Suppose we toss 49 HEADS and 31 TAILS, and suppose the coin was taken from a box containing three coins: one which gives HEADS with probability $$p=1/3$$, one which gives HEADS with probability $$p=1/2$$ and another which gives heads with probability $$p=2/3$$. The coins have lost their labels, so we don't know which one it was. Using maximum likelihood estimation we can calculate which coin it was most likely to have been, given the data that we observed. The likelihood function (defined above) takes one of three values:



\begin{matrix} \mathbb{P}(\mbox{we toss 49 HEADS out of 80}\mid p=1/3) & = & \binom{80}{49}(1/3)^{49}(1-1/3)^{31} = 0.000 \\ &&\\ \mathbb{P}(\mbox{we toss 49 HEADS out of 80}\mid p=1/2) & = & \binom{80}{49}(1/2)^{49}(1-1/2)^{31} = 0.012 \\ &&\\ \mathbb{P}(\mbox{we toss 49 HEADS out of 80}\mid p=2/3) & = & \binom{80}{49}(2/3)^{49}(1-2/3)^{31} = 0.054 \\ \end{matrix} $$

We see that the likelihood is maximised by parameter $$\hat{p}=2/3$$, and so this is our maximum likelihood estimate for $$p$$.

Discrete distribution, continuous parameter space
Now suppose our special box of coins from example 1 contains an infinite number of coins: one for every possible value $$0\leq p \leq 1$$. We must maximise the likelihood function:



\begin{matrix} \mbox{lik}(\theta) & = & f_D(\mbox{observe 49 HEADS out of 80}\mid p) = \binom{80}{49} p^{49}(1-p)^{31} \\ \end{matrix} $$

over all possible values $$0\leq p \leq 1$$.

One may maximise this function by differentiating with respect to $$p$$ and setting to zero:



\begin{matrix} 0 & = & \frac{d}{dp} \left( \binom{80}{49} p^{49}(1-p)^{31} \right) \\ &  & \\  & \propto & 49p^{48}(1-p)^{31} - 31p^{49}(1-p)^{30} \\ &  & \\  & = & p^{48}(1-p)^{30}\left[ 49(1-p) - 31p \right] \\ \end{matrix} $$

which has solutions $$p=0$$, $$p=1$$, and $$p=49/80$$. The solution which maximises the likelihood is clearly $$p=49/80$$ (since $$p=0$$ and $$p=1$$ result in a likelihood of zero). Thus we say the maximum likelihood estimator for $$p$$ is $$\hat{p}=49/80$$.

This result is easily generalised by substituting a letter such as $$t$$ in the place of 49 to represent the observed number of 'successes' of our Bernoulli trials, and a letter such as $$n$$ in the place of 80 to represent the number of Bernoulli trials. Exactly the same calculation yields the maximum likelihood estimator:


 * $$\hat{p}=\frac{t}{n}$$

for any sequence of $$n$$ Bernoulli trials resulting in $$t$$ 'successes'.

Continuous distribution, continuous parameter space
One of the most common continuous probability distributions is the Normal distribution which has probability density function:


 * $$f(x\mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

The corresponding density function for a sample of $$n$$ independent identically distributed normal random variables is:


 * $$f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^\frac{n}{2} e^{-\frac{ \sum_{i=1}^{n}(x_i-\mu)^2}{2\sigma^2}}$$

or more conveniently:


 * $$f(x_1,\ldots,x_n \mid \mu,\sigma^2) = \left( \frac{1}{2\pi\sigma^2} \right)^\frac{n}{2} e^{-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}}$$

This distribution has two parameters: $$\mu,\sigma^2$$. This may be alarming to some, given that in the discussion above we only talked about maximising over a single parameter. However there is no need for alarm: we simply maximise the likelihood $$\mbox{lik}(\mu,\sigma) = f(x_1,,\ldots,x_n \mid \mu, \sigma^2)$$ over each parameter separately, which of course is more work but no more complicated. In the above notation we would write $$\theta=(\mu,\sigma^2)$$.

When maximising the likelihood, we may equivalently maximise the log of the likelihood, since log is a continuous strictly increasing function over the range of the likelihood. [Note: the log-likelihood is closely related to information entropy and Fisher information ]. This often simplifies the algebra somewhat, and indeed does so in this case:



\begin{matrix} 0 & = & \frac{\partial}{\partial \mu} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^\frac{n}{2} e^{-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}} \right) \\ & = & \frac{\partial}{\partial \mu} \left( \log\left( \frac{1}{2\pi\sigma^2} \right)^\frac{n}{2} - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right) \\ & = & 0 - \frac{-2n(\bar{x}-\mu)}{2\sigma^2} \\ \end{matrix} $$

which is solved by $$\hat{\mu} = \bar{x} = \sum^{n}_{i=1}x_i/n $$. This is indeed the maximum of the function since it is the only turning point in $$\mu$$ and the second derivative is strictly less than zero.

Similarly we differentiate with respect to $$\sigma$$ and equate to zero.



\begin{matrix} 0 & = & \frac{\partial}{\partial \sigma} \log \left( \left( \frac{1}{2\pi\sigma^2} \right)^\frac{n}{2} e^{-\frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2}} \right) \\ & = & \frac{\partial}{\partial \sigma} \left( \frac{n}{2}\log\left( \frac{1}{2\pi\sigma^2} \right) - \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{2\sigma^2} \right) \\ & = & -\frac{n}{\sigma} + \frac{ \sum_{i=1}^{n}(x_i-\bar{x})^2+n(\bar{x}-\mu)^2}{\sigma^3} \\ \end{matrix} $$

which is solved by $$\hat{\sigma}^2 = \sum_{i=1}^n(x_i-\hat{\mu})^2/n$$.

Formally we say that the maximum likelihood estimator for $$\theta=(\mu,\sigma^2)$$ is:


 * $$\hat{\theta}=(\hat{\mu},\hat{\sigma}^2) = (\bar{x},\sum_{i=1}^n(x_i-\bar{x})^2/n)$$.

Functional invariance
If $$\widehat{\theta}$$ is the maximum likelihood estimator (MLE) for $$\theta$$, then the MLE for $$\alpha = g(\theta)$$ is $$\widehat{\alpha} = g(\widehat{\theta})$$. The function g need not be one-to-one. For detail, please refer to the proof of Theorem 7.2.10 of Statistical Inference by George Casella and Roger L. Berger.

Asymptotic behaviour
Maximum likelihood estimators achieve minimum variance (as given by the Cramer-Rao lower bound) in the limit as the sample size tends to infinity. When the MLE is unbiased, we may equivalently say that it has minimum mean squared error in the limit.

For independent observations, the maximum likelihood estimator often follows an asymptotic normal distribution.

Bias
The bias of maximum-likelihood estimators can be substantial. Consider a case where n tickets numbered from 1 to n are placed in a box and one is selected at random (see uniform distribution). If n is unknown, then the maximum-likelihood estimator of n is the value on the drawn ticket, even though the expectation is only $$(n+1)/2$$. In estimating the highest number n, we can only be certain that it is greater than or equal to the drawn ticket number.

External resources

 * A paper detailing the history of maximum likelihood, written by John Aldrich

Maximum-Likelihood-Methode Maximum Likelihood 最尤法