Mixture model

In mathematics, the term mixture model is a model in which independent variables are fractions of a total.

Examples
Suppose researchers are trying to find the optimal mixture of ingredients for a fruit punch consisting of grape juice, mango juice, and pineapple juice. A mixture model is suitable here because the results of the taste tests will not depend on the amount of ingredients used to make the batch but rather on the fraction of each ingredient present in the punch. The components always sum to a whole, which a mixture model takes into account.

As another example, financial returns often behave differently in normal situations and during crisis times. A mixture model for return data seems reasonable.

Direct and indirect applications of mixture models
The financial example above is one direct application of the mixture model, a situation in which we assume an underlying mechanism so that each observation belongs to one of some number of different sources or categories. This underlying mechanism may or may not, however, be observable. In this form of mixture, each of the sources is described by a component probability density function, and its mixture weight is the probability that an observation comes from this component.

In an indirect application of the mixture model we do not assume such a mechanism. The mixture model is simply used for its mathematical flexibilities. For example, a mixture of two normal distributions with different means may result in a density with two modes, which is not modeled by standard parametric distributions.

Probability mixture model
In statistics, a probability mixture model is a probability distribution that is a convex combination of other probability distributions.

Suppose that the discrete random variable $$X$$ is a mixture of $$n$$ component discrete random variables $$Y_i$$. Then, the probability mass function of $$X$$, $$f_{X}(x)$$, is a weighted sum of its component distributions:


 * $$f_{X}(x) = \sum_{i=1}^{n} a_i f_{Y_i}(x)$$

for some mixture proportions $$0 < a_{i}< 1$$ where $$a_{1} +\cdots+ a_{n} = 1$$.

The definition is the same for continuous random variables, except that the functions $$f$$ are probability density functions.

Parametric mixture model
In the parametric mixture model, the component distributions are from a parametric family, with unknown parameters $$\theta_i$$:


 * $$f_{X}(x) = \sum_{i=1}^{n} a_i f_Y(x ; \theta_i)$$

Continuous mixture
A continuous mixture is defined similarly:


 * $$f_{X}(x) = \int_\Theta h(\theta) f_Y(x ; \theta) \, d\theta$$

where
 * $$0 \le h(\theta) \quad \forall \theta \in \Theta$$

and
 * $$\int_\Theta h(\theta) \, d\theta = 1$$

Identifiability
Identifiability refers to the existence of a unique characterization for any one of the models in the class being considered. Estimation procedure may not be well-defined and asymptotic theory may not hold if a model is not identifiable.

Example
Let $$J$$ be the class of all binomial distributions with $$ n=2$$. Then a mixture of two members of $$J$$ would have


 * $$p_0=\pi(1-\theta_1)^2+(1-\pi)(1-\theta_2)^2$$
 * $$p_1=2\pi\theta_1(1-\theta_1)+2(1-\pi)\theta_2(1-\theta_2)$$

and $$p_2=1-p_0-p_1$$. Clearly, given $$p_0$$ and $$p_1$$, it is not possible to determine the above mixture model uniquely, as there are three parameters ($$\pi,\theta_1,\theta_2$$) to be determined.

Definition
Consider a mixture of parametric distributions of the same class. Let


 * $$J=\{f(\cdot ; \theta):\theta\in\Omega\}$$

be the class of all component distributions. Then the convex hull $$K$$ of $$J$$ defines the class of all finite mixture of distributions in $$J$$:


 * $$K=\{p(\cdot):p(\cdot)=\sum_{i=1}^n a_i f_i(\cdot ; \theta_i), a_i>0, \sum_{i=1}^n a_i=1, f_i(\cdot ; \theta_i)\in J\ \forall i,n\}$$

$$K$$ is said to be identifiable if all its members are unique, that is, given two members $$p$$ and $$p'$$ in $$K$$, being mixtures of $$k$$ distributions and $$k'$$ distributions respectively in $$J$$, we have $$p=p'$$ if and only if, first of all, $$k=k'$$ and secondly we can reorder the summations such that $$a_i=a_i'$$ and $$f_i=f_i'$$ for all $$i$$.

Common approaches for estimation in mixture models
Parametric mixture models are often used when we know the distribution $$Y$$ and we can sample from $$X$$, but we would like to determine the $$a_{i}$$ and $$\theta_i$$ values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.

It's common to think of probability mixture modeling as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.

Expectation maximization
The Expectation-maximization algorithm can be used to compute the parameters of a parametric mixture model distribution (the $$a_{i}$$'s and $$\theta_{i}$$'s). It is an iterative algorithm with two steps: an expectation step and a maximization step. Practical examples of EM and Mixture Modeling are included in the SOCR demonstrations.

The expectation step
With initial guesses for the parameters of our mixture model, we compute "partial membership" of each data point in each constituent distribution. This is done by calculating expectation values for the membership variables of each data point. That is, for each data point $$x_j$$ and distribution $$Y_i$$, we compute a membership value $$y_{i,j}$$:


 * $$ y_{i,j} = \frac{a_i f_Y(x_j;\theta_i)}{f_{X}(x_j)}$$

The maximization step
With our expectation values in hand for group membership, we can recompute plug-in estimates of our distribution parameters.

The mixing coefficients $$a_i$$ are the means of the membership values over the $$N$$ data points.


 * $$ a_i = \frac{1}{N}\sum_{j=1}^N y_{i,j}$$

The component model parameters $$\theta_{i}$$ are also calculated by expectation maximization using data points $$x_j$$ that have been weighted using the membership values. For example, if $$\theta$$ is a mean $$\mu$$


 * $$ \mu_{i} = \frac{\sum_{j} y_{i,j}x_{j}}{\sum_{j} y_{i,j}}$$

With new estimates for $$a_i$$ and the $$\theta_i$$'s, we proceed back to the expectation step to recompute new membership values. The procedure is repeated until model parameters converge.

Markov chain Monte Carlo
As an alternative to the EM algorithm, we can use posterior sampling as indicated by Bayes' theorem to deduce parameters in our mixture model. Once again we regard this as an incomplete data problem where membership of data points is our missing data. We resort to a method called Gibbs sampling which is once again a two-step iterative procedure.

We'll use the example of a mixture of two Gaussian distributions to demonstrate how the method works. We start again with initial guessed parameters for our mixture model. Instead of computing partial memberships for each elemental distribution, we draw a membership value for each data point from a Bernoulli distribution (that is, it will be assigned to either the first or the second Gaussian). The Bernoulli parameter $$\theta$$ is determined for each data point on the basis of one of the constituent distributions. Draws from the distribution generate membership associations for each data point. We can then use plug-in estimators as in the M step of EM to generate a new set of mixture model parameters, and return to the binomial draw step.

Spectral method
Some problems in mixture model estimation can be solved using spectral techniques. In particular it becomes useful if data points $$x_i$$ are points in high-dimensional Euclidean space, and the hidden distributions are known to be log-concave (such as Gaussian distribution or Exponential distribution).

Spectral methods of learning mixture models are based on the use of Singular Value Decomposition of a matrix which contains data points. The idea is to consider the top $$k$$ singular vectors, where $$k$$ is the number of distributions we are trying to learn. The projection of each data point to a linear subspace spanned by those vectors groups points originating from the same distribution very close together, while points from different distributions stay far apart.

One distinctive feature of the spectral method is that it allows us to prove that if distributions satisfy certain separation condition (e.g. not too close), then the estimated mixture will be very close to the true one with high probability.

Other methods
Other methods which guarantee accurate estimation have emerged in the last few years. Some of them can even provably learn mixtures of heavy-tailed distributions including those with infinite variance (see links to papers below). In this setting, EM based methods would not work, since the Expectation step would diverge due to presence of outliers.

A simulation
To simulate a sample of size N that is a mixture of distributions Fi, i=1 to I, with probability pi each (sigma pi=1): 1. Generate random numbers from a Bernoulli distribution of size N 2. Generate N random numbers each from the I distributions Fi 3. Pick random number from the ith distribution if the corresponding Bernoulli random number is between pi-1 and pi

Books on mixture models

 * 1) Titterington, D., A. Smith, and U. Makov "Statistical Analysis of Finite Mixture Distributions," John Wiley & Sons (1985).
 * 2) G. McLachlan, D. Peel Finite Mixture Models,, Wiley (2000)
 * 3) Marin, J.M., Mengersen, K. and Robert, C.P.  "Bayesian modelling and inference on mixtures of distributions". Handbook of Statistics 25, D. Dey and C.R. Rao (eds). Elsevier-Sciences (to appear). available as PDF
 * 4) Lindsay B.G., Mixture Models: Theory, Geometry, and Applications. NSF-CBMS Regional Conference Series in Probability and Statistics Vol. 5, Institute of Mathematical Statistics, Hayward (1995).