Variational Bayesian methods

Variational Bayesian methods, also called ensemble learning, are a family of techniques for approximating intractable integrals arising in Bayesian statistics and machine learning. They can be used to lower bound the marginal likelihood (i.e. "evidence") of several models with a view to performing model selection, and often provide an analytical approximation to the parameter posterior which is useful for prediction. It is an alternative to Monte Carlo sampling methods for making use of a posterior distribution that is difficult to sample from directly.

Mathematical derivation
In variational inference, the posterior distribution over a set of latent variables $$X = \{X_1 \dots X_n\}$$ given some data $$D$$ is approximated by a variational distribution


 * $$P(X|D) \approx Q(X).$$

The variational distribution $$Q(X)$$ is restricted to belong to a family of distributions of simpler form than $$P(X|D)$$. This family is selected with the intention that $$Q$$ can be made very similar to the true posterior. The difference between $$Q$$ and this true posterior is measured in terms of a dissimilarity function $$d(Q; P)$$ and hence inference is performed by selecting the distribution $$Q$$ that minimises $$d$$. One choice of dissimilarity function where this minimisation is tractable is the Kullback-Leibler divergence (KL divergence), defined as


 * $$KL(Q || P) = \sum_X Q(X) \log \frac{Q(X)}{P(X|D)}.$$

We can write the log evidence as




 * $$\log P(D)\!$$
 * $$= KL(Q||P) - \sum_X Q(X) \log \frac{Q(X)}{P(X,D)}$$
 * $$= KL(Q||P) + \mathcal{L}(Q)$$.
 * }
 * $$= KL(Q||P) + \mathcal{L}(Q)$$.
 * }
 * }

As the log evidence is fixed with respect to $$Q$$, maximising the final term $$\mathcal{L}(Q)$$ will minimise the KL divergence between $$Q$$ and $$P$$. By appropriate choice of $$Q$$, we can make $$\mathcal{L}(Q)$$ tractable to compute and to maximise. Hence we have both a lower bound on the evidence $$\mathcal{L}(Q)$$ and an analytical approximation to the posterior $$Q$$.