Boltzmann machine

A Boltzmann machine is the name given to a type of simulated annealing stochastic recurrent neural network by Geoffrey Hinton and Terry Sejnowski. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They were one of the first examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient time) solve difficult combinatoric problems. However, due to a number of issues discussed below, Boltzmann machines with unconstrained connectivity have not proven useful for practical problems in machine learning or inference. They are still theoretically intriguing, however, due to the locality and Hebbian nature of their training algorithm, as well as their parallelism and the resemblance of their dynamics to simple physical processes. If the connectivity is constrained, the learning can be made efficient enough to be useful for practical problems.

Structure
A Boltzmann machine, like a Hopfield Network, is a network of units with an "energy" defined for the network. It also has binary units, but unlike Hopfield nets, Boltzmann machine units are stochastic. The global energy, $$E$$, in a Boltzmann machine is identical in form to that of a Hopfield network:


 * $$E = -\sum_{i<j} w_{ij} \, s_i \, s_j + \sum_i \theta_i \, s_i$$

Where:
 * $$w_{ij}$$ is the connection strength between unit $$j$$ and unit $$i$$.
 * $$s_i$$ is the state, $$s_i \in \{-1,1\}$$, of unit $$i$$.
 * $$\theta_i$$ is the threshold of unit $$i$$.

The connections in a Boltzmann machine have two restrictions:
 * $$w_{ii}=0 \qquad \forall i$$. (No unit has a connection with itself.)
 * $$w_{ij}=w_{ji}\qquad \forall i,j$$. (All connections are symmetric.)

Thus, the difference in the global energy that results from a single unit $$i$$ being 0 versus 1, written $$\Delta E_i$$, is given by:


 * $$\Delta E_i = \sum_j w_{ij} \, s_j - \theta_i$$

A Boltzmann machine is made up of stochastic units. The probability, $$p_i$$ of the $$i$$-th unit being on is given by:


 * $$p_i = \cfrac{1}{1+e^{\left[\cfrac{-\Delta E_i}{T}\right]}}$$

where the scalar $$T$$ is referred to as the temperature of the system.

The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state's energy, according to a Boltzmann Distribution. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach a thermal equilibrium at a low temperature, we are guaranteed to converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.

If we want to train the network so that the chance it will converge to a global state is according to an external distribution that we have over these states, we need to set the weights so that the global states with the highest probabilities will get the lowest energies. This is done by the following training procedure.

Training
The units in the Boltzmann Machine are divided into "visible" units, V, and "hidden" units, H. The visible units are those which receive information from the "environment", i.e. our training set is a bunch of binary vectors over the set V. The distribution over the training set is denoted $$P^{+}(V)$$.

On the Boltzmann Machine side, as recalled, the distribution over the global states is converging as we reach a thermal equilibrium. We denote the converged distribution, after we marginalize it over the visible units $$V$$, as $$P^{-}(V)$$.

Our goal is to approximate the "real" distribution $$P^{+}(V)$$ using the $$P^{-}(V)$$ which will be produced (eventually) by the machine. To measure how similar the two distributions are we use the Kullback-Leibler distance, $$G$$:


 * $$G = \sum_{v}{P^{+}(v)\ln{\frac{P^{+}(v)}{P^{-}(v)}}}$$

Where the sum is over all the possible states of $$V$$. $$G$$ is a function of the weights, since they determine the energy of a state, and the energy determines $$P^{-}(v)$$, as promised by the Boltzmann distribution. Hence, we can use a gradient descent algorithm over $$G$$, so a given weight, $$w_{ij}$$ is changed by subtracting the partial derivative of $$G$$ with respect to the weight.

There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to $$P^{+}$$). The other is the "negative" phase where the network is allowed to run freely, i.e. no units have their state determined by external data. Surprisingly enough, the gradient with respect to a given weight, $$w_{ij}$$, is given by the very simple equation:


 * $$\frac{\partial{G}}{\partial{w_{ij}}} = -\frac{1}{T}[p_{ij}^{+}-p_{ij}^{-}]$$

Where:
 * $$p_{ij}^{+}$$ is the probability of units i and j both being on when the machine is at equilibrium on the positive phase.


 * $$p_{ij}^{-}$$ is the probability of units i and j both being on when the machine is at equilibrium on the negative phase.

This result follows from the fact that at the thermal equilibrium the probability $$P^{-}(s)$$ of any global state $$s$$ when the network is free-running is given by the Boltzmann distribution (hence the name "Boltzmann machine").

Remarkably, this learning rule is fairly biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (or synapse biologically speaking) does not need information about anything other than the two neurons it connects. This is far more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as backpropagation.

The training of a Boltzmann machines can also be viewed as an application of the EM algorithm, which is heavily used in machine learning. We are given an incomplete dataset (we only know the values of the visible units). At the positive phase, we estimate the completed data based on the current parameters of the system (the weights). Later, the weights are updated to maximize the probability of the network producing the completed data.

Problems
If it worked, the Boltzmann machine would be a sort of universal computational medium, able to solve nearly any learning problem. For instance, if trained on photographs, the machine would model the distribution of photographs, and could use that model to, for example, complete a partial photograph.

Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that the learning seems to stop working correctly when the machine is scaled up to anything larger than a trivial machine. This is due to a number of effects, the most important of which are:


 * the time the machine must be run for in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths
 * connection strengths are more plastic when the units being connected have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to random walk until the activities saturate.

History
The Boltzmann machine is a Monte Carlo version of the Hopfield network.

The idea of using annealed Ising models for inference is often thought to have originated from Geoff Hinton. However some claim that an earlier version of the same idea is present in Douglas Hofstadter's Copycat project, described in the following two papers:


 * Hofstadter, Douglas R., The Copycat Project: An Experiment in Nondeterminism and Creative Analogies. MIT Artificial Intelligence Laboratory Memo No. 755, January 1984.


 * Hofstadter, Douglas R., A Non-Deterministic Approach to Analogy, Involving the Ising Model of Ferromagnetism. In E. Caianiello, ed. The Physics of Cognitive Processes. Teaneck, NJ: World Scientific, 1987.

Others, without meaning to disparage Douglas Hofstadter's contributions to the field or any potential claims of independent discovery, contend that the following earlier papers describe the idea:


 * Geoffrey E. Hinton and Terrence J. Sejnowski, Analyzing Cooperative Computation. In Proceedings of the 5th Annual Congress of the Cognitive Science Society, Rochester, NY, May 1983.


 * Geoffrey E. Hinton and Terrence J. Sejnowski, Optimal Perceptual Inference. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR),                                          pages 448-453, IEEE Computer Society, Washington DC, June 1983.

Restricted Boltzmann Machine
Although learning is impractical in general Boltzmann machines, it can be made quite efficient in an architecture called the "Restricted Boltzmann Machine" or "RBM" which does not allow connections between hidden units. After learning one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBM's makes it possible to learn many layers of hidden units efficiently and as each new layer is added the overall generative model gets better. Examples of multilayer RBM models of hand-written digits and faces can be viewed at http://www.cs.utoronto.ca/~hinton/