Training Products of Experts by Minimizing Contrastive Divergence

It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual expert models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is therefore an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary. Training a PoE by maximizing the likelihood of the data is difficult because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Fortunately, a PoE can be trained using a different objective function called contrastive divergence whose derivatives with regard to the parameters can be approximated accurately and efficiently. Examples are presented of contrastive divergence learning using several types of expert on several types of data.

[1]  Patrick Henry Winston,et al.  Learning structural descriptions from examples , 1970 .

[2]  Patrick Henry Winston,et al.  The psychology of computer vision , 1976, Pattern Recognit..

[3]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[5]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[6]  Christian Genest,et al.  Combining Probability Distributions: A Critique and an Annotated Bibliography , 1986 .

[7]  Geoffrey E. Hinton,et al.  Learning Representations by Recirculation , 1987, NIPS.

[8]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[9]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[10]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[11]  Michael I. Jordan,et al.  Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[12]  Geoffrey E. Hinton,et al.  Using Generative Models for Handwritten Digit Recognition , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Randall C. O'Reilly,et al.  Biologically Plausible Error-Driven Learning Using Local Activation Differences: The Generalized Recirculation Algorithm , 1996, Neural Computation.

[14]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[15]  Peter Reimann,et al.  Unsupervised learning of distributions , 1997 .

[16]  H. Sebastian Seung,et al.  Learning Continuous Attractors in Recurrent Networks , 1997, NIPS.

[17]  Tom Heskes,et al.  Bias/Variance Decompositions for Likelihood-Based Estimators , 1998, Neural Computation.

[18]  Michael I. Jordan,et al.  Attractor Dynamics in Feedforward Neural Networks , 2000, Neural Computation.

[19]  Geoffrey E. Hinton,et al.  Recognizing Hand-written Digits Using Hierarchical Products of Experts , 2002, NIPS.

[20]  Thomas Lukasiewicz MAXIMUM ENTROPY , 2000 .

[21]  Yee Whye Teh,et al.  Rate-coded Restricted Boltzmann Machines for Face Recognition , 2000, NIPS.

[22]  Toniann Pitassi,et al.  A Gradient-Based Boosting Algorithm for Regression Problems , 2000, NIPS.

[23]  Geoffrey E. Hinton,et al.  Products of Hidden Markov Models , 2001, AISTATS.