Exponential family

In probability and statistics, an exponential family is any class of probability distributions having a certain form. This special form is chosen for mathematical convenience, on account of some useful algebraic properties; as well as for generality, as exponential families are in a sense very natural distributions to consider. Exponential families is credited to E. J. G. Pitman, G. Darmois, and B. O. Koopman in 1935-6.

Definition
The following is a sequence of increasingly general definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.

Scalar parameter
A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form
 * $$ f_X(x; \theta) = h(x) \exp(\eta(\theta) T(x) - A(\theta)) \,\!$$

where $$T(x)$$, $$h(x)$$, $$\eta(\theta)$$, and $$A(\theta)$$ are known functions.

The value &theta; is called the parameter of the family.

Note that x is often a vector of measurements, in which case T(x) is a function from the space of possible values of x to the real numbers.

If &eta;(&theta;) = &theta;, then the exponential family is said to be in canonical form. By defining a transformed parameter &eta; = &eta;(&theta;), it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since &eta;(&theta;) can be multiplied by any nonzero constant, while T(x) is multiplied by its inverse.

Further down the page is the example of a normal distribution with unknown mean and known variance.

Vector parameter
The single-parameter definition can be extended to a vector parameter $${\boldsymbol \theta} = (\theta_1, \theta_2, \ldots, \theta_s)^T$$. A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as
 * $$ f_X(x; \theta) = h(x) \exp\left(\sum_{i=1}^s \eta_i({\boldsymbol \theta}) T_i(x) - A({\boldsymbol \theta}) \right) \,\!$$

As in the scalar valued case, the exponential family is said to be in canonical form if $$\eta_i({\boldsymbol \theta}) = \theta_i$$, for all $$i$$.

Further down the page is the example of a normal distribution with unknown mean and variance.

Measure-theoretic formulation
We use cumulative distribution functions (cdf) in order to encompass both discrete and continuous distributions.

Suppose H is a non-decreasing function of a real variable and H(x) approaches 0 as x approaches &minus;&infin;. Then Lebesgue-Stieltjes integrals with respect to dH(x) are integrals with respect to the "reference measure" of the exponential family generated by H.

Any member of that exponential family has cumulative distribution function
 * $$dF(x|\eta) = e^{-\eta^{\top} T(x) - A(\eta)}\, dH(x).$$

If F is a continuous distribution with a density, one can write dF(x) = f(x) dx.

H(x) is a Lebesgue-Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is continuous with a density, then so is H, which can then be written dH(x) = h(x) dx. If F is discrete, then H is a step function (with steps on the support of F).

Interpretation
In the definitions above, the functions $$T(x), \eta(\theta),$$ and $$A(\theta)$$ were arbitrarily defined. However, these functions play a significant role in the resulting probability distribution.


 * $$T(x)$$ is a sufficient statistic of the distribution. Thus, for exponential families, there exists a sufficient statistic whose dimension equals the number of parameters to be estimated. This important property is further discussed below.


 * $$\eta$$ is called the natural parameter. The set of values of $$\eta$$ for which the function $$f_X(x;\theta)$$ is finite is called the natural parameter space. It can be shown that the natural parameter space is always convex.


 * $$A(\theta)$$ is a normalization factor without which $$f_X(x;\theta)$$ would not be a probability distribution. The function A is important in its own right, because in cases in which the reference measure $$dH(x)$$ is a probability measure (alternatively: when $$h(x)$$ is a probability density), then A is the cumulant generating function of the probability distribution of the sufficient statistic $$T(X)$$ when the distribution of $$X$$ is $$dH(x)$$.

Examples
The normal, gamma, chi-square, beta, Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial, geometric, and Weibull distributions are all exponential families. The Cauchy, Laplace, and uniform families of distributions are not exponential families.

Following are some detailed examples of the representation of some useful distribution as exponential families.

Normal distribution: Unknown mean, unit variance
As a first example, suppose $$x$$ is distributed normally with unknown mean $$\mu$$ and variance 1. The probability density function is then
 * $$f_X(x;\mu) = \frac{1}{\sqrt{2 \pi}} e^{-(x-\mu)^2/2}.$$

This is a scalar exponential family in canonical form, as can be seen by setting
 * $$h(x) = e^{-x^2/2}/\sqrt{2\pi}$$
 * $$T(x) = x\!\,$$
 * $$A(\mu) = \mu^2/2.\!\,$$

Normal distribution: Unknown mean and unknown variance
Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then
 * $$f_X(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-(x-\mu)^2/2 \sigma^2}.$$

This is an exponential family which can be written in canonical form by defining
 * $$ {\boldsymbol \theta} = \left({\mu \over \sigma^2},{1 \over \sigma^2} \right)^T $$
 * $$ h(x) = {1 \over \sqrt{2 \pi}} $$
 * $$ T(x) = \left( x, -{x^2 \over 2} \right)^T $$
 * $$ A({\boldsymbol \theta}) = { \theta_1^2 \over 2 \theta_2} - \ln( \theta_2^{1/2} ) = { \mu^2 \over 2 \sigma^2} - \ln \left( {1 \over \sigma } \right) $$

Binomial distribution
As an example of a discrete exponential family, consider the binomial distribution. The probability mass function for this distribution is
 * $$f(x)={n \choose x}p^x (1-p)^{n-x}, \quad x \in \{0, 1, 2, \ldots, n\}.$$

This can equivalently be written as
 * $$f(x)={n \choose x}\exp\left(x \log\left({p \over 1-p}\right) + n \log\left(1-p\right)\right),$$

which shows that the binomial distribution is an exponential family, whose natural parameter is
 * $$\eta = \log{p \over 1-p}.$$

Differential identities
As mentioned above, $$\scriptstyle K(u) = A(u + \eta) - A(\eta) $$ is the cumulant generating function for $$\scriptstyle T $$. A consequence of this is that one can fully understand the mean and covariance structure of $$\scriptstyle T = (T_{1}, T_{2}, \dots, T_{p}) $$ by differentiating $$ \scriptstyle A(\eta) $$.


 * $$ E(T_{j}) = \frac{ \partial A(\eta) }{ \partial \eta_{j} } $$

and


 * $$ \mathrm{cov}(T_{i},T_{j}) = \frac{ \partial^{2} A(\eta) }{ \partial \eta_{i} \, \partial \eta_{j} }. $$

The first two raw moments and all mixed moments can be recovered from these two identities. This is often useful when $$\scriptstyle T $$ is a complicated function of the data whose moments are difficult to calculate by integration. As an example consider a real valued random variable $$\scriptstyle X $$ with density


 * $$ p_{\theta}(x) = \frac{ \theta e^{-x} }{(1 + e^{-x})^{\theta + 1} } $$

indexed by shape parameter $$ \theta \in (0,\infty) $$ (this distribution is called the skew-logistic). The density can be rewritten as


 * $$ \frac{ e^{-x} } { 1 + e^{-x} } \mathrm{exp}( -\theta \mathrm{log}(1 + e^{-x}) + \mathrm{log}(\theta)) $$

Notice this is an exponential family with canonical parameter


 * $$ \eta = -\theta, $$

sufficient statistic


 * $$ T = \mathrm{log}(1 + e^{-x}), $$

and normalizing factor


 * $$ A(\eta) = -\mathrm{log}(\theta) = -\mathrm{log}(-\eta) $$

So using the first identity,


 * $$ E(\mathrm{log}(1 + e^{-X})) = E(T) = \frac{ \partial A(\eta) }{ \partial \eta } = \frac{ \partial }{ \partial \eta } [-\mathrm{log}(-\eta)] = \frac{1}{-\eta} = \frac{1}{\theta}, $$

and using the second identity


 * $$ \mathrm{var}(\mathrm{log}(1 + e^{-X})) = \frac{ \partial^{2} A(\eta) }{ \partial \eta^{2} } = \frac{ \partial }{ \partial \eta } \left[\frac{1}{-\eta}\right] = \frac{1}{(-\eta)^{2}} = \frac{1}{\theta^2}. $$

This example illustrates a case where using this method is very simple, but the brute force calculation would be nearly impossible.

Maximum entropy derivation
The exponential family arises naturally as the answer to the following question: what is the maximum entropy distribution consistent with given constraints on expected values?

The information entropy of a probability distribution dF(x) can only be computed with respect to some other probability distribution (or, more generally, a positive measure), and both measures must be mutually absolutely continuous. Accordingly, we need to pick a reference measure dH(x) with the same support as dF(x). As an aside, frequentists need to realize that this is a largely arbitrary choice, while Bayesians can just make this choice part of their prior probability distribution.

The entropy of dF(x) relative to dH(x) is


 * $$S[dF|dH]=-\int {dF\over dH}\ln{dF\over dH}\,dH$$

or


 * $$S[dF|dH]=\int\ln{dH\over dF}\,dF$$

where dF/dH and dH/dF are Radon-Nikodym derivatives. Note that the ordinary definition of entropy for a discrete distribution supported on a set I, namely


 * $$S=-\sum_{i\in I} p_i\ln p_i$$

assumes (though this is seldom pointed out) that dH is chosen to be counting measure on I.

Consider now a collection of observable quantities (random variables) Ti. The probability distribution dF whose entropy with respect to dH is greatest, subject to the conditions that the expected value of Ti be equal to ti, is a member of the exponential family with dH as reference measure and (T1, ..., Tn) as sufficient statistic.

The derivation is a simple variational calculation using Lagrange multipliers. Normalization is imposed by letting T0 = 1 be one of the constraints. The natural parameters of the distribution are the Lagrange multipliers, and the normalization factor is the Lagrange multiplier associated to T0.

Classical estimation: sufficiency
According to the Pitman-Koopman-Darmois theorem, among families of probability distributions whose domain does not vary with the parameter being estimated, only in exponential families is there a sufficient statistic whose dimension remains bounded as sample size increases. Less tersely, suppose Xn, n = 1, 2, 3, ... are independent identically distributed random variables whose distribution is known to be in some family of probability distributions. Only if that family is an exponential family is there a (possibly vector-valued) sufficient statistic T(X1, ..., Xn) whose number of scalar components does not increase as the sample size n increases.

Bayesian estimation: conjugate distributions
Exponential families are also important in Bayesian statistics. In Bayesian statistics a prior distribution is multiplied by a likelihood function and then normalised to produce a posterior distribution. In the case of a likelihood which belongs to the exponential family there exists a conjugate prior, which is often also in the exponential family. A conjugate prior &pi; for the parameter &eta; of an exponential family is given by


 * $$\pi(\eta) \propto \exp(-\eta^{\top} \alpha - \beta\, A(\eta)),$$

where $$\alpha \in \mathbb{R}^n$$ and $$\beta>0$$ are hyperparameters (parameters controlling parameters).

A conjugate prior is one which, when combined with the likelihood and normalised, produces a posterior distribution which is of the same type as the prior. For example, if one is estimating the success probability of a binomial distribution, then if one chooses to use a beta distribution as one's prior, the posterior is another beta distribution. This makes the computation of the posterior particularly simple. Similarly, if one is estimating the parameter of a Poisson distribution the use of a gamma prior will lead to another gamma posterior. Conjugate priors are often very flexible and can be very convenient. However, if one's belief about the likely value of the theta parameter of a binomial is represented by (say) a bimodal (two-humped) prior distribution, then this cannot be represented by a beta distribution.

An arbitrary likelihood will not belong to the exponential family, and thus in general no conjugate prior exists. The posterior will then have to be computed by numerical methods.