论文信息 - Ensemble Learning for Multi-Layer Networks

Ensemble Learning for Multi-Layer Networks

Bayesian treatments of learning in neural networks are typically based either on local Gaussian approximations to a mode of the posterior weight distribution, or on Markov chain Monte Carlo simulations. A third approach, called ensemble learning, was introduced by Hinton and van Camp (1993). It aims to approximate the posterior distribution by minimizing the Kullback-Leibler divergence between the true posterior and a parametric approximating distribution. However, the derivation of a deterministic algorithm relied on the use of a Gaussian approximating distribution with a diagonal covariance matrix and so was unable to capture the posterior correlations between parameters. In this paper, we show how the ensemble learning approach can be extended to full-covariance Gaussian distributions while remaining computationally tractable. We also extend the framework to deal with hyperparameters, leading to a simple re-estimation procedure. Initial results from a standard benchmark problem are encouraging.

David Barber | Christopher M. Bishop | Charles M. Bishop | D. Barber

[1] J. Jensen. Sur les fonctions convexes et les inégalités entre les valeurs moyennes , 1906 .

[2] H. Jeffreys. An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[3] N. Metropolis,et al. Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[4] H. Rauch. Solutions to the linear smoothing problem , 1963 .

[5] C. Striebel,et al. On the maximum likelihood estimates for linear dynamic systems , 1965 .

[6] Andrew J. Viterbi,et al. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[7] O. Seeberg. Statistical Mechanics. — A Set of Lectures , 1975 .

[8] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9] G. Torrie,et al. Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling , 1977 .

[10] G. Schwarz. Estimating the Dimension of a Model , 1978 .

[11] Temple F. Smith. Occam's razor , 1980, Nature.

[12] R. Shumway,et al. AN APPROACH TO TIME SERIES SMOOTHING AND FORECASTING USING THE EM ALGORITHM , 1982 .

[13] C. D. Gelatt,et al. Optimization by Simulated Annealing , 1983, Science.

[14] C. S. Wallace,et al. Estimation and Inference by Compact Coding , 1987 .

[15] David J. Spiegelhalter,et al. Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[16] Judea Pearl,et al. Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[17] R. T. Cox. Probability, frequency and reasonable expectation , 1990 .

[18] A. O'Hagan,et al. Bayes–Hermite quadrature , 1991 .

[19] Geoffrey E. Hinton,et al. Mean field networks that learn to discriminate temporally distorted strings , 1991 .

[20] Biing-Hwang Juang,et al. Hidden Markov Models for Speech Recognition , 1991 .

[21] James O. Berger,et al. Ockham's Razor and Bayesian Analysis , 1992 .

[22] Andreas Stolcke,et al. Hidden Markov Model} Induction by Bayesian Model Merging , 1992, NIPS.

[23] David J. C. MacKay,et al. Bayesian Interpolation , 1992, Neural Computation.

[24] Radford M. Neal. Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[25] David J. C. MacKay,et al. A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[26] C. Robert,et al. Bayesian estimation of hidden Markov chains: a stochastic implementation , 1993 .

[27] Geoffrey E. Hinton,et al. Keeping Neural Networks Simple , 1993 .

[28] Geoffrey E. Hinton,et al. Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[29] Heekuck Oh,et al. Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[30] Jonathan J. Hull,et al. A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[31] David J. C. MacKay,et al. A hierarchical Dirichlet language model , 1995, Natural Language Engineering.

[32] David Mackay,et al. Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[33] Steve R. Waterhouse,et al. Bayesian Methods for Mixtures of Experts , 1995, NIPS.

[34] Geoffrey E. Hinton,et al. Bayesian Learning for Neural Networks , 1995 .

[35] David J. C. MacKay,et al. Developments in Probabilistic Modelling with Neural Networks - Ensemble Learning , 1995, SNN Symposium on Neural Networks.

[36] Michael I. Jordan,et al. Mean Field Theory for Sigmoid Belief Networks , 1996, J. Artif. Intell. Res..

[37] Peter C. Cheeseman,et al. Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[38] Michael I. Jordan,et al. Hidden Markov Decision Trees , 1996, NIPS.

[39] David Bruce Wilson,et al. Exact sampling with coupled Markov chains and applications to statistical mechanics , 1996, Random Struct. Algorithms.

[40] James Allen Fill,et al. An interruptible algorithm for perfect sampling via Markov chains , 1997, STOC '97.

[41] Michael I. Jordan,et al. Variational methods for inference and estimation in graphical models , 1997 .

[42] Neil D. Lawrence,et al. Approximating Posterior Distributions in Belief Networks Using Mixtures , 1997, NIPS.

[43] P. Green,et al. Corrigendum: On Bayesian analysis of mixtures with an unknown number of components , 1997 .

[44] P. Saama. MAXIMUM LIKELIHOOD AND BAYESIAN METHODS FOR MIXTURES OF NORMAL DISTRIBUTIONS , 1997 .

[45] David Barber,et al. On Computing the KL Divergence for Bayesian Neural Networks , 1997 .

[46] Michael I. Jordan,et al. Probabilistic Independence Networks for Hidden Markov Probability Models , 1997, Neural Computation.

[47] C. Cruz,et al. Improving the Mean Field Approximation via the Use of Mixture Distributions , 1998 .

[48] Christopher K. I. Williams,et al. DTs: Dynamic Trees , 1998, NIPS.

[49] Jim Q. Smith,et al. On the Geometry of Bayesian Graphical Models with Hidden Variables , 1998, UAI.

[50] Nir Friedman,et al. The Bayesian Structural EM Algorithm , 1998, UAI.

[51] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[52] Ross D. Shachter. Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams) , 1998, UAI.

[53] Radford M. Neal. Assessing Relevance determination methods using DELVE , 1998 .

[54] Xavier Boyen,et al. Tractable Inference for Complex Stochastic Processes , 1998, UAI.

[55] Geoffrey E. Hinton,et al. A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[56] Neil D. Lawrence,et al. Mixture Representations for Inference and Learning in Boltzmann Machines , 1998, UAI.

[57] P. Green,et al. Exact Sampling from a Continuous State Space , 1998 .

[58] William D. Penny,et al. Bayesian Approaches to Gaussian Mixture Modeling , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[59] Christopher M. Bishop,et al. Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[60] G. Casella,et al. Perfect Slice Samplers for Mixtures of Distributions , 1999 .

[61] Zoubin Ghahramani,et al. A Unifying Review of Linear Gaussian Models , 1999, Neural Computation.

[62] Harri Lappalainen,et al. Ensemble learning for independent component analysis , 1999 .

[63] Carl E. Rasmussen,et al. The Infinite Gaussian Mixture Model , 1999, NIPS.

[64] David J. Spiegelhalter,et al. Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[65] David J. C. MacKay,et al. Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[66] Neil D. Lawrence,et al. A Variational B ayesian Committee of Neural Networks , 1999 .