Using fast weights to improve persistent contrastive divergence

The most commonly used learning algorithm for restricted Boltzmann machines is contrastive divergence which starts a Markov chain at a data point and runs the chain for only a few iterations to get a cheap, low variance estimate of the sufficient statistics under the model. Tieleman (2008) showed that better learning can be achieved by estimating the model's statistics using a small set of persistent "fantasy particles" that are not reinitialized to data points after each weight update. With sufficiently small weight updates, the fantasy particles represent the equilibrium distribution accurately but to explain why the method works with much larger weight updates it is necessary to consider the interaction between the weight updates and the Markov chain. We show that the weight updates force the Markov chain to mix fast, and using this insight we develop an even faster mixing chain that uses an auxiliary set of "fast weights" to implement a temporary overlay on the energy landscape. The fast weights learn rapidly but also decay rapidly and do not contribute to the normal energy landscape that defines the model.

[1]  H. Robbins A Stochastic Approximation Method , 1951 .

[2]  Geoffrey E. Hinton,et al.  A New Learning Algorithm for Mean Field Boltzmann Machines , 2002, ICANN.

[3]  Vivek S. Borkar,et al.  Stochastic approximation algorithms: Overview and recent trends , 1999 .

[4]  Cristian Sminchisescu,et al.  Generalized Darting Monte Carlo , 2007, AISTATS.

[5]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[6]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[7]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[8]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[9]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[10]  D Cvijovicacute,et al.  Taboo search: an approach to the multiple minima problem. , 1995, Science.

[11]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[12]  Tijmen Tieleman,et al.  Training restricted Boltzmann machines using approximations to the likelihood gradient , 2008, ICML '08.

[13]  Jacek Klinowski,et al.  Taboo Search: An Approach to the Multiple Minima Problem , 1995, Science.

[14]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[15]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[16]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[17]  Gang George Yin,et al.  Stochastic approximation algorithms for trailing stop , 2008, 2008 47th IEEE Conference on Decision and Control.