Covariance matrix

Overview
In statistics and probability theory, the covariance matrix is a matrix of covariances between elements of a vector. It is the natural generalization to higher dimensions of the concept of the variance of a scalar-valued random variable.

Definition
If entries in the column vector


 * $$X = \begin{bmatrix}X_1 \\ \vdots \\ X_n \end{bmatrix}$$

are random variables, each with finite variance, then the covariance matrix &Sigma; is the matrix whose (i, j) entry is the covariance



\Sigma_{ij} =\mathrm{E}\begin{bmatrix} (X_i - \mu_i)(X_j - \mu_j) \end{bmatrix} $$

where


 * $$\mu_i = \mathrm{E}(X_i)\,$$

is the expected value of the ith entry in the vector X. In other words, we have



\Sigma = \begin{bmatrix} \mathrm{E}[(X_1 - \mu_1)(X_1 - \mu_1)] & \mathrm{E}[(X_1 - \mu_1)(X_2 - \mu_2)] & \cdots & \mathrm{E}[(X_1 - \mu_1)(X_n - \mu_n)] \\ \\ \mathrm{E}[(X_2 - \mu_2)(X_1 - \mu_1)] & \mathrm{E}[(X_2 - \mu_2)(X_2 - \mu_2)] & \cdots & \mathrm{E}[(X_2 - \mu_2)(X_n - \mu_n)] \\ \\ \vdots & \vdots & \ddots & \vdots \\ \\ \mathrm{E}[(X_n - \mu_n)(X_1 - \mu_1)] & \mathrm{E}[(X_n - \mu_n)(X_2 - \mu_2)] & \cdots & \mathrm{E}[(X_n - \mu_n)(X_n - \mu_n)] \end{bmatrix}. $$

As a generalization of the variance
The definition above is equivalent to the matrix equality



\Sigma=\mathrm{E} \left[ \left( \textbf{X} - \mathrm{E}[\textbf{X}] \right) \left( \textbf{X} - \mathrm{E}[\textbf{X}] \right)^\top \right] $$

This form can be seen as a generalization of the scalar-valued variance to higher dimensions. Recall that for a scalar-valued random variable X



\sigma^2 = \mathrm{var}(X) = \mathrm{E}[(X-\mu)^2], \, $$

where


 * $$\mu = \mathrm{E}(X).\,$$

Conflicting nomenclatures and notations
Nomenclatures differ. Some statisticians, following the probabilist William Feller, call this matrix the variance of the random vector $$X$$, because it is the natural generalization to higher dimensions of the 1-dimensional variance. Others call it the covariance matrix, because it is the matrix of covariances between the scalar components of the vector $$X$$. Thus

\operatorname{var}(\textbf{X}) = \operatorname{cov}(\textbf{X}) = \mathrm{E} \left[ (\textbf{X} - \mathrm{E} [\textbf{X}]) (\textbf{X} - \mathrm{E} [\textbf{X}])^\top \right] $$

However, the notation for the "cross-covariance" between two vectors is standard:

\operatorname{cov}(\textbf{X},\textbf{Y}) = \mathrm{E} \left[ (\textbf{X} - \mathrm{E}[\textbf{X}]) (\textbf{Y} - \mathrm{E}[\textbf{Y}])^\top \right] $$

The $$var$$ notation is found in William Feller's two-volume book An Introduction to Probability Theory and Its Applications, but both forms are quite standard and there is no ambiguity between them.

Properties
For $$\Sigma=\mathrm{E} \left[ \left( \textbf{X} - \mathrm{E}[\textbf{X}] \right) \left( \textbf{X} - \mathrm{E}[\textbf{X}] \right)^\top \right]$$ and $$ \mu = \mathrm{E}(\textbf{X})$$ the following basic properties apply:
 * 1) $$ \Sigma = \mathrm{E}(\mathbf{X X^\top}) - \mathbf{\mu}\mathbf{\mu^\top} $$
 * 2) $$ \mathbf{\Sigma}$$ is positive semi-definite
 * 3) $$ \operatorname{var}(\mathbf{A X} + \mathbf{a}) = \mathbf{A}\, \operatorname{var}(\mathbf{X})\, \mathbf{A^\top} $$
 * 4) $$ \operatorname{cov}(\mathbf{X},\mathbf{Y}) = \operatorname{cov}(\mathbf{Y},\mathbf{X})$$
 * 5) $$ \operatorname{cov}(\mathbf{X_1} + \mathbf{X_2},\mathbf{Y}) = \operatorname{cov}(\mathbf{X_1},\mathbf{Y}) + \operatorname{cov}(\mathbf{X_2}, \mathbf{Y})$$
 * 6) If p = q, then $$\operatorname{var}(\mathbf{X} + \mathbf{Y}) = \operatorname{var}(\mathbf{X}) + \operatorname{cov}(\mathbf{X},\mathbf{Y}) + \operatorname{cov}(\mathbf{Y}, \mathbf{X}) + \operatorname{var}(\mathbf{Y})$$
 * 7) $$\operatorname{cov}(\mathbf{AX}, \mathbf{BY}) = \mathbf{A}\, \operatorname{cov}(\mathbf{X}, \mathbf{Y}) \,\mathbf{B}^\top$$
 * 8) If $$\mathbf{X}$$ and $$\mathbf{Y}$$ are independent, then $$\operatorname{cov}(\mathbf{X}, \mathbf{Y}) = 0$$

where $$\mathbf{X}, \mathbf{X_1}$$ and $$\mathbf{X_2}$$ are a random $$\mathbf{(p \times 1)}$$ vectors, $$\mathbf{Y}$$ is a random $$\mathbf{(q \times 1)}$$ vector, $$\mathbf{a}$$ is $$\mathbf{(p \times 1)}$$ vector, $$\mathbf{A}$$ and $$\mathbf{B}$$ are $$\mathbf{(p \times q)}$$ matrices.

This covariance matrix (though very simple) is a very useful tool in many very different areas. From it a transformation matrix can be derived that allows one to completely decorrelate the data or, from a different point of view, to find an optimal basis for representing the data in a compact way (see Rayleigh quotient for a formal proof and additional properties of covariance matrices). This is called principal components analysis (PCA) in statistics and Karhunen-Loève transform (KL-transform) in image processing.

Which matrices are covariance matrices
From the identity


 * $$\operatorname{var}(\mathbf{a^\top}\mathbf{X}) = \mathbf{a^\top} \operatorname{var}(\mathbf{X}) \mathbf{a}\,$$

and the fact that the variance of any real-valued random variable is nonnegative, it follows immediately that only a nonnegative-definite matrix can be a covariance matrix. The converse question is whether every nonnegative-definite symmetric matrix is a covariance matrix. The answer is "yes". To see this, suppose M is a p&times;p nonnegative-definite symmetric matrix. From the finite-dimensional case of the spectral theorem, it follows that M has a nonnegative symmetric square root, which let us call M1/2. Let $$\mathbf{X}$$ be any p&times;1 column vector-valued random variable whose covariance matrix is the p&times;p identity matrix. Then


 * $$\operatorname{var}(M^{1/2}\mathbf{X}) = M^{1/2} (\operatorname{var}(\mathbf{X})) M^{1/2} = M.\,$$

Complex random vectors
The variance of a complex scalar-valued random variable with expected value μ is conventionally defined using complex conjugation:



\operatorname{var}(z) = \operatorname{E} \left[ (z-\mu)(z-\mu)^{*} \right] $$

where the complex conjugate of a complex number $$z$$ is denoted $$z^{*}$$.

If $$Z$$ is a column-vector of complex-valued random variables, then we take the conjugate transpose by both transposing and conjugating, getting a square matrix:



\operatorname{E} \left[ (Z-\mu)(Z-\mu)^{*} \right] $$

where $$Z^{*}$$ denotes the conjugate transpose, which is applicable to the scalar case since the transpose of a scalar is still a scalar.

LaTeX provides useful features for dealing with covariance matrices. These are available through the extendedmath package.

Estimation
The derivation of the maximum-likelihood estimator of the covariance matrix of a multivariate normal distribution is perhaps surprisingly subtle. It involves the spectral theorem and the reason why it can be better to view a scalar as the trace of a 1 &times; 1 matrix than as a mere scalar. See estimation of covariance matrices.