Bethe free energy and contrastive divergence approximations for undirected graphical models

As the machine learning community tackles more complex and harder problems, the graphical models needed to solve such problems become larger and more complicated. As a result performing inference and learning exactly for such graphical models become ever more expensive, and approximate inference and learning techniques become ever more prominent. There are a variety of techniques for approximate inference and learning in the literature. This thesis contributes some new ideas in the products of experts (PoEs) class of models (Hinton, 2002), and the Bethe free energy approximations (Yedidia et al., 2001). For PoEs, our contribution is in developing new PoE models for continuous-valued domains. We developed RBMrate, a model for discretized continuous-valued data. We applied it to face recognition to demonstrate its abilities. We also developed energy-based models (EBMs)—flexible probabilistic models where the building blocks consist of energy terms computed using a feed-forward network. We show that standard square noiseless independent components analysis (ICA) (Bell and Sejnowski, 1995) can be viewed as a restricted form of EBMs. Extending this relationship with ICA, we describe sparse and over-complete representations of data where the inference process is trivial since it is simply an EBM. For Bethe free energy approximations, our contribution is a theory relating belief propagation and iterative scaling. We show that both belief propagation and iterative scaling updates can be derived as fixed point equations for constrained minimization of the Bethe free energy. This allows us to develop a new algorithm to directly minimize the Bethe free energy, and to apply the Bethe free energy to learning in addition to inference. We also describe improvements to the efficiency of standard learning algorithms for undirected graphical models (Jirousek and Preucil, 1995).

[1]  W. Deming,et al.  On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known , 1940 .

[2]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[3]  R. Palmer,et al.  Solution of 'Solvable model of a spin glass' , 1977 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  T. Plefka Convergence condition of the TAP equation for the infinite-ranged Ising spin glass model , 1982 .

[6]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[7]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[8]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[9]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[10]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[11]  Prakash P. Shenoy,et al.  Axioms for probability and belief-function proagation , 1990, UAI.

[12]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[13]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[14]  Prakash P. Shenoy,et al.  Probability propagation , 1990, Annals of Mathematics and Artificial Intelligence.

[15]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[16]  J. Yedidia,et al.  How to expand around mean-field theory using high-temperature expansions , 1991 .

[17]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[18]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[19]  Edward H. Adelson,et al.  Shiftable multiscale transforms , 1992, IEEE Trans. Inf. Theory.

[20]  T.,et al.  Shiftable Multi-scale TransformsEero , 1992 .

[21]  G. Parisi,et al.  Simulated tempering: a new Monte Carlo scheme , 1992, hep-lat/9205018.

[22]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[23]  Lee,et al.  New Monte Carlo algorithm: Entropic sampling. , 1993, Physical review letters.

[24]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[25]  Wai Lam,et al.  LEARNING BAYESIAN BELIEF NETWORKS: AN APPROACH BASED ON THE MDL PRINCIPLE , 1994, Comput. Intell..

[26]  R. Jirousek,et al.  On the effective implementation of the iterative proportional fitting procedure , 1995 .

[27]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[28]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[29]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[30]  Radford M. Neal,et al.  Near Shannon limit performance of low density parity check codes , 1996 .

[31]  Radford M. Neal Sampling from multimodal distributions using tempered transitions , 1996, Stat. Comput..

[32]  Barak A. Pearlmutter,et al.  A Context-Sensitive Generalization of ICA , 1996 .

[33]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[34]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[35]  Steffen L. Lauritzen,et al.  Graphical models in R , 1996 .

[36]  J. Propp,et al.  Exact sampling with coupled Markov chains and applications to statistical mechanics , 1996 .

[37]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[38]  Hyeonjoon Moon,et al.  The FERET September 1996 Database and Evaluation Procedure , 1997, AVBPA.

[39]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[40]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[41]  Edward H. Adelson,et al.  Belief Propagation and Revision in Networks with Loops , 1997 .

[42]  Alex Pentland,et al.  Probabilistic Visual Learning for Object Representation , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[43]  J. Cardoso Infomax and maximum likelihood for blind source separation , 1997, IEEE Signal Processing Letters.

[44]  Brendan J. Frey,et al.  A Revolution: Belief Propagation in Graphs with Cycles , 1997, NIPS.

[45]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[46]  David J. Kriegman,et al.  Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection , 1996, ECCV.

[47]  Song-Chun Zhu,et al.  Minimax Entropy Principle and Its Application to Texture Modeling , 1997, Neural Computation.

[48]  Robert Cowell,et al.  Introduction to Inference for Bayesian Networks , 1998, Learning in Graphical Models.

[49]  Jung-Fu Cheng,et al.  Turbo Decoding as an Instance of Pearl's "Belief Propagation" Algorithm , 1998, IEEE J. Sel. Areas Commun..

[50]  J. H. Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998 .

[51]  Brian Sallans,et al.  A Hierarchical Community of Experts , 1999, Learning in Graphical Models.

[52]  Nir Friedman,et al.  The Bayesian Structural EM Algorithm , 1998, UAI.

[53]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[54]  Hilbert J. Kappen,et al.  Efficient Learning in Boltzmann Machines Using Linear Response Theory , 1998, Neural Computation.

[55]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[56]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[57]  Alex Pentland,et al.  Beyond eigenfaces: probabilistic matching for face recognition , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[58]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine-mediated learning.

[59]  William T. Freeman,et al.  Learning to Estimate Scenes from Images , 1998, NIPS.

[60]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[61]  K. Jarrod Millman,et al.  Learning Sparse Codes with a Mixture-of-Gaussians Prior , 1999, NIPS.

[62]  Hagai Attias,et al.  Independent Factor Analysis , 1999, Neural Computation.

[63]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[64]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[65]  Bruno A. Olshausen,et al.  PROBABILISTIC FRAMEWORK FOR THE ADAPTATION AND COMPARISON OF IMAGE CODES , 1999 .

[66]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[67]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[68]  Geoffrey E. Hinton,et al.  Recognizing Hand-written Digits Using Hierarchical Products of Experts , 2002, NIPS.

[69]  Yair Weiss,et al.  Correctness of Local Probability Propagation in Graphical Models with Loops , 2000, Neural Computation.

[70]  Geoffrey E. Hinton,et al.  Using Free Energies to Represent Q-values in a Multiagent Reinforcement Learning Task , 2000, NIPS.

[71]  Brendan J. Frey,et al.  Accumulator Networks: Suitors of Local Probability Propagation , 2000, NIPS.

[72]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[73]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[74]  Yee Whye Teh,et al.  Rate-coded Restricted Boltzmann Machines for Face Recognition , 2000, NIPS.

[75]  J. Yedidia An Idiosyncratic Journey Beyond Mean Field Theory , 2000 .

[76]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[77]  Daniel D. Lee,et al.  An Information Maximization Approach to Overcomplete and Recurrent Representations , 2000, NIPS.

[78]  Brendan J. Frey,et al.  Sequentially Fitting "Inclusive" Trees for Inference in Noisy-OR Networks , 2000, NIPS.

[79]  Zoubin Ghahramani,et al.  Propagation Algorithms for Variational Bayesian Learning , 2000, NIPS.

[80]  S. Aji,et al.  The Generalized Distributive Law and Free Energy Minimization , 2001 .

[81]  M. Opper,et al.  From Naive Mean Field Theory to the TAP Equations , 2001 .

[82]  Radford M. Neal Annealed importance sampling , 1998, Stat. Comput..

[83]  William T. Freeman,et al.  On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs , 2001, IEEE Trans. Inf. Theory.

[84]  D. Mackay,et al.  Failures of the One-Step Learning Algorithm , 2001 .

[85]  Yee Whye Teh,et al.  Discovering Multiple Constraints that are Frequently Approximately Satisfied , 2001, UAI.

[86]  Mark A. Girolami,et al.  A Variational Method for Learning Sparse and Overcomplete Representations , 2001, Neural Computation.

[87]  Y. Teh,et al.  GCNU TR 2001 – 001 Passing And Bouncing Messages For Generalized Inference , 2001 .

[88]  Aapo Hyvärinen,et al.  A two-layer sparse coding model learns simple and complex cell receptive fields and topography from natural images , 2001, Vision Research.

[89]  Yee Whye Teh,et al.  Belief Optimization for Binary Networks: A Stable Alternative to Loopy Belief Propagation , 2001, UAI.

[90]  Hilbert J. Kappen,et al.  Mean field theory for graphical models , 2001 .

[91]  M. Opper,et al.  Advanced mean field methods: theory and practice , 2001 .

[92]  Geoffrey E. Hinton,et al.  Products of Hidden Markov Models , 2001, AISTATS.

[93]  Yee Whye Teh,et al.  The Unified Propagation and Scaling Algorithm , 2001, NIPS.

[94]  Quaid Morris,et al.  Recognition Networks for Approximate Inference in BN20 Networks , 2001, UAI.

[95]  Mark D. Plumbley,et al.  IF THE INDEPENDENT COMPONENTS OF NATURAL IMAGES ARE EDGES, WHAT ARE THE INDEPENDENT COMPONENTS OF NATURAL SOUNDS? , 2001 .

[96]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[97]  Martin J. Wainwright,et al.  Tree-based reparameterization for approximate inference on loopy graphs , 2001, NIPS.

[98]  Yee Whye Teh,et al.  A New View of ICA , 2001 .

[99]  Michael I. Jordan,et al.  Thin Junction Trees , 2001, NIPS.

[100]  M. Opper,et al.  Information Geometry of Mean Field Approximation , 2001 .

[101]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[102]  P. McCullagh,et al.  Discussion of Paper by Peter Mccullagh , 2002 .

[103]  Tom Heskes,et al.  Stable Fixed Points of Loopy Belief Propagation Are Local Minima of the Bethe Free Energy , 2002, NIPS.

[104]  Sekhar Tatikonda,et al.  Loopy Belief Propogation and Gibbs Measures , 2002, UAI.

[105]  Geoffrey E. Hinton,et al.  Learning Sparse Topographic Representations with Products of Student-t Distributions , 2002, NIPS.

[106]  Martin J. Wainwright,et al.  Stochastic processes on graphs with cycles: geometric and variational approaches , 2002 .

[107]  Marian Stewart Bartlett,et al.  Face recognition by independent component analysis , 2002, IEEE Trans. Neural Networks.

[108]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[109]  Alan L. Yuille,et al.  CCCP Algorithms to Minimize the Bethe and Kikuchi Free Energies: Convergent Alternatives to Belief Propagation , 2002, Neural Computation.

[110]  T. Heskes Stable Fixed Points of Loopy Belief Propagation Are Minima of the Bethe Free Energy , 2002 .

[111]  Christopher K. I. Williams,et al.  An analysis of contrastive divergence learning in gaussian boltzmann machines , 2002 .

[112]  Yee Whye Teh,et al.  On Improving the Efficiency of the Iterative Proportional Fitting Procedure , 2003, AISTATS.

[113]  Adam Berger,et al.  The Improved Iterative Scaling Algorithm A Gentle Introduction , 2003 .

[114]  Yee Whye Teh,et al.  Approximate inference in Boltzmann machines , 2003, Artif. Intell..

[115]  Matthew J. Beal,et al.  The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures , 2003 .

[116]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[117]  Jitendra Malik,et al.  Contour and Texture Analysis for Image Segmentation , 2001, International Journal of Computer Vision.

[118]  Aapo Hyvärinen,et al.  Estimating Overcomplete Independent Component Bases for Image Windows , 2002, Journal of Mathematical Imaging and Vision.

[119]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[120]  Martin J. Wainwright,et al.  A new class of upper bounds on the log partition function , 2002, IEEE Transactions on Information Theory.