Higher-Order Boltzmann Machines

T h e Boltzmann machine is a nonlinear network of stochastic binary processing units t ha t interact pairwise through symmetric connection strengths. In a third-order Boltzmann machine, triples of units interact through symmetric conjunctive interactions. The Boltzmann learning algorithm is generalized t o higher-order interactions. The rate of learning for internal representations in a higher-order Boltzmann machine should be much faster t han for a second-order Boltzmann machine based on pairwise interactions. I N T R O D U C T I O N Thousands of hours of practice are required by humans t o become experts in domains such as chess, mathematics and physics1. Learning in these domains requires the mastery of a large number of highly interrelated ideas, and a deep understanding requires generalization as well as memorization. There are two traditions in the literature on learning in neural network models. One class of models is based on the problem of content-addressable memory and emphasizes a fast, one-shot form of learning. The second class of models uses slow, incremental learning, which requires many repetitions of examples. I t is difficult in humans t o s tudy fast and slow learning in isolation. In some amnesics, however, the long-term retention of facts is severely impaired, but the slow acquisition of skills, including cognitive skills, is spared2. Thus, it is possible tha t separate memory mechanisms are used t o implement fast learning and slow learning. Long practice is required t o become a n expert, bu t expert performance is swift and difficult to analyze; with more practice there is faster performance1. Why is slow learning so slow? One possibility is t h a t the expert develops internal representations tha t allow fast parallel searches for solutions t o problems in the task domain, in contrast t o a novice who must apply knowledge piecemeal. An internal representation is a mental model of the task domain; t h a t is, internal degrees of freedom between the sensory inputs and motor outputs t h a t efficiently encode the variables relevant t o the solution of the problem. This approach can be made more precise by specifying neural network models and showing how they incorporate internal representations. L E A R N I N G IN NETWORK M O D E L S Network models of fast learning include linear correlation-matrix m o d e l ~ ~ ~ ~ ~ ~ l ~ and the more recent nonlinear autoassociative m o d e ~ s ~ ~ ~ ~ ~ ~ ' ~ . These models use the Hebb learning rule t o store information t h a t can be retrieved by the completion of partially specified input patterns. New patterns are stored by imposing the pattern on the network and altering the connection strengths between the pairs of units that are above threshold. The information that is stored therefore concerns the correlations, .or second-order relationships between the components of the pattern. The internal model is built from correlations. Network models of slow learning include the perceptronll and adaline12. These networks can classify input patterns given only examples of inputs and desired outputs. The connection strengths are changed incrementally during the training and the network gradually converges to a set of weights tha t solves the problem if such as set of weights exists. Unfortunately, there are many difficult problems that cannot be solved with these networks, such as the prediction of parity'3. The perceptron and adaline are limited because they have only one layer of modifiable connection strengths and can only implement linear discriminant functions. Higher-order problems like parity cannot be solved by storing the desired patterns using the class of contentaddressable algorithms based on the Hebb learning rule. These models are limited because the metric of similarity is based on Hamming distance and only correlations can be used t o access patterns. The first network model to demonstrably learn t o solve higher-order problems was the Boltzmann machine, which overcame the limitations of previous network models by introducing hidden units14*15116. Hidden units are added t o the network t o mediate between the input and output units; they provide the extra internal degrees of freedom needed t o form internal representations. The Boltzmann learning algorithm incrementally modifies internal connections in the network to build higher-order pattern detectors. The hidden units can be recruited t o form internal representations for any problem; however, the learning may require an extremely large number of training examples and can be excessively slow. One way t o speed up the learning is to use hidden units that have higher-order interactions with other units. THIRD-ORDER BOLTZMANN MACHINES Consider a Boltzmann machine with a cubic global energy function: where si is the state of the i t h binary unit and w;p is a weight between triples of units. This type of interaction generalizes the pairwise interactions in Hopfield networkslo and Boltzmann machines, which contribute a quadratic term t o the energy. Fig. 1 shows an interpretation of the cubic term as conjunctive synapses. Each unit in the network updates its binary state asynchronously with probability where T is a parameter the i t h unit is given by analagous t o the temperature and the total input to If wijk is symmetric on all pairs of indices then the energy of the network is nonincreasing. It can be shown that in equilibrium the probabilities of global states P, follow a Boltzmann distribution Fig. 1. Third-order interactions between three units. In the diagram the lines between units represent reciprocal interactions that are activated only when the third unit is in the on state. The third unit acts presynaptically to conjunctively control the painvise interactions. There are two forms of the Boltzmann learning algorithm, one for networks with inputs and outputs treated identically, and a second for networks where the input units are always clamped15. The former learning algorithm will be generalized for third-order interactions. The learning metric on weight space remains the same: where P , is the probability of a global state with both the inputs and outputs clamped, and Pd, is the probability of a global state when the network is allowed to run freely. I t can be shown that the gradient of G is given by where p i jk is the ensemble average probability of three units all being in the on state when the input and output units are clamped, and pi;k is the corresponding probability when the network is running freely. T o minimize G , it is sufficient to measure the time averaged triple co-occurence probabilities when the network is in equilibrium under the two conditions and t o change each weight according to where c scales the size of each weight change. HIGHER-ORDER BOLTZMANN MACHINES Define the energy of a k -th order Boltzmann machine as where w 7172 . . . 7t is a k -dimensional weight indices. The G matrix can be minimized by matrix symmetric on all pairs of gradient descent: where P ~ , ~ . . . is the probability of the k-tuple co-occurence of the I (s 71 ,S 72 , . . s 7r ) when the inputs and outputs are clamped, and p 7172. . . 7L is the corresponding probability when the network is freely running. In general, the energy for a Boltzmann machine is the sum over all orders of interaction and the learning algorithm is a linear combination of terms from each order. This is a Markov random field with polynomial interactions17.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[3]  T. SHALLICE,et al.  Learning and Memory , 1970, Nature.

[4]  G. Miller,et al.  Cognitive science. , 1981, Science.

[5]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .