A family of algorithms for approximate Bayesian inference

One of the major obstacles to using Bayesian methods for pattern recognition has been its computational expense. This thesis presents an approximation technique that can perform Bayesian inference faster and more accurately than previously possible. This method, “Expectation Propagation,” unifies and generalizes two previous techniques: assumed-density filtering, an extension of the Kalman filter, and loopy belief propagation, an extension of belief propagation in Bayesian networks. The unification shows how both of these algorithms can be viewed as approximating the true posterior distribution with simpler distribution, which is close in the sense of KL-divergence. Expectation Propagation exploits the best of both algorithms: the generality of assumed-density filtering and the accuracy of loopy belief propagation. Loopy belief propagation, because it propagates exact belief states, is useful for limited types of belief networks, such as purely discrete networks. Expectation Propagation approximates the belief states with expectations, such as means and variances, giving it much wider scope. Expectation Propagation also extends belief propagation in the opposite direction—propagating richer belief states which incorporate correlations between variables. This framework is demonstrated in a variety of statistical models using synthetic and real-world data. On Gaussian mixture problems, Expectation Propagation is found, for the same amount of computation, to be convincingly better than rival approximation techniques: Monte Carlo, Laplace's method, and variational Bayes. For pattern recognition, Expectation Propagation provides an algorithm for training Bayes Point Machine classifiers that is faster and more accurate than any previously known. The resulting classifiers outperform Support Vector Machines on several standard datasets, in addition to having a comparable training time. Expectation Propagation can also be used to choose an appropriate feature set for classification, via Bayesian model selection. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  J. Atchison,et al.  Logistic-normal distributions:Some properties and uses , 1980 .

[2]  Ross D. Shachter A linear approximation method for probabilistic inference , 2013, UAI.

[3]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[4]  S. Lauritzen Propagation of Probabilities, Means, and Variances in Mixed Graphical Association Models , 1992 .

[5]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[6]  T. Watkin Optimal Learning with a Neural Network , 1993 .

[7]  A. Dawid,et al.  A comparison of sequential learning methods for incomplete data , 1995 .

[8]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[9]  Steve R. Waterhouse,et al.  Bayesian Methods for Mixtures of Experts , 1995, NIPS.

[10]  Van den Broeck C,et al.  Gradient descent learning in perceptrons: A review of its possibilities. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[11]  David Barber,et al.  Ensemble Learning for Multi-Layer Networks , 1997, NIPS.

[12]  A. Buhot,et al.  Finite size scaling of the Bayesian perceptron , 1997 .

[13]  P. Saama MAXIMUM LIKELIHOOD AND BAYESIAN METHODS FOR MIXTURES OF NORMAL DISTRIBUTIONS , 1997 .

[14]  Brendan J. Frey,et al.  A Revolution: Belief Propagation in Graphs with Cycles , 1997, NIPS.

[15]  C. Cruz,et al.  Improving the Mean Field Approximation via the Use of Mixture Distributions , 1998 .

[16]  Ole Winther,et al.  Bayesian Mean Field Algorithms for Neural Networks and Gaussian Processes , 1998 .

[17]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[18]  Michael I. Jordan,et al.  Improving the Mean Field Approximation Via the Use of Mixture Distributions , 1999, Learning in Graphical Models.

[19]  Nello Cristianini,et al.  Dynamically Adapting Kernels in Support Vector Machines , 1998, NIPS.

[20]  Xavier Boyen,et al.  Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[21]  Xavier Boyen,et al.  Approximate Learning of Dynamic Models , 1998, NIPS.

[22]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[23]  David Barber,et al.  Tractable Variational Structures for Approximating Graphical Models , 1998, NIPS.

[24]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[25]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[26]  Dragomir Anguelov,et al.  A General Algorithm for Approximate Inference and Its Application to Hybrid Bayes Nets , 1999, UAI.

[27]  Ralf Herbrich,et al.  Bayes Point Machines: Estimating the Bayes Point in Kernel Space , 1999 .

[28]  Matthias W. Seeger,et al.  Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers , 1999, NIPS.

[29]  Michael I. Jordan,et al.  Variational Probabilistic Inference and the QMR-DT Network , 2011, J. Artif. Intell. Res..

[30]  Manfred Opper,et al.  A Bayesian approach to on-line learning , 1999 .

[31]  Hagai Attias,et al.  Inferring Parameters and Structure of Latent Variable Models by Variational Bayes , 1999, UAI.

[32]  David Barber,et al.  Gaussian Fields for Approximate Inference in Layered Sigmoid Belief Networks , 1999, NIPS.

[33]  Olivier Chapelle,et al.  Model Selection for Support Vector Machines , 1999, NIPS.

[34]  Benjamin Van Roy,et al.  An Analysis of Turbo Decoding with Gaussian Densities , 1999, NIPS.

[35]  Zoubin Ghahramani,et al.  Variational Inference for Bayesian Mixtures of Factor Analysers , 1999, NIPS.

[36]  Ole Winther,et al.  Efficient Approaches to Gaussian Process Classification , 1999, NIPS.

[37]  Thore Graepel,et al.  A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work , 2000, NIPS.

[38]  Daphne Koller,et al.  Restricted Bayes Optimal Classifiers , 2000, AAAI/IAAI.

[39]  Hilbert J. Kappen,et al.  Second Order Approximations for Probability Models , 2000, NIPS.

[40]  Jean-François Richard,et al.  Methods of Numerical Integration , 2000 .

[41]  Tom Minka,et al.  Automatic Choice of Dimensionality for PCA , 2000, NIPS.

[42]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[43]  Ole WintherWe Adaptive Tap Equations , 2000 .

[44]  Harold J. Kushner,et al.  A nonlinear filtering algorithm based on an approximation of the conditional distribution , 2000, IEEE Trans. Autom. Control..

[45]  J. Yedidia An Idiosyncratic Journey Beyond Mean Field Theory , 2000 .

[46]  M. Opper,et al.  Eecient Approaches to Gaussian Process Classiication , 2000 .

[47]  Jun S. Liu,et al.  Mixture Kalman ®lters , 2000 .

[48]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[49]  Ole Winther,et al.  Gaussian Processes for Classification: Mean-Field Algorithms , 2000, Neural Computation.

[50]  Jun S. Liu,et al.  Mixture Kalman filters , 2000 .

[51]  M. Opper,et al.  Gaussian processes and SVM: Mean field results and leave-one-out , 2000 .

[52]  Brendan J. Frey,et al.  Sequentially Fitting "Inclusive" Trees for Inference in Noisy-OR Networks , 2000, NIPS.

[53]  Michael I. Jordan,et al.  Bayesian parameter estimation via variational methods , 2000, Stat. Comput..

[54]  Amos J. Storkey Dynamic Trees: A Structured Variational Method Giving Efficient Propagation Rules , 2000, UAI.

[55]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[56]  M. Opper,et al.  An Idiosyncratic Journey Beyond Mean Field Theory , 2001 .

[57]  Kevin P. Murphy,et al.  The Factored Frontier Algorithm for Approximate Inference in DBNs , 2001, UAI.