Bayesian Methods for Adaptive Models

The Bayesian framework for model comparison and regularisation is demonstrated by studying interpolation and classification problems modelled with both linear and non–linear models. This framework quantitatively embodies ‘Occam’s razor’. Over–complex and under– regularised models are automatically inferred to be less probable, even though their flexibility allows them to fit the data better. When applied to ‘neural networks’, the Bayesian framework makes possible (1) objective comparison of solutions using alternative network architectures; (2) objective stopping rules for network pruning or growing procedures; (3) objective choice of type of weight decay terms (or regularisers); (4) on–line techniques for optimising weight decay (or regularisation constant) magnitude; (5) a measure of the effective number of well–determined parameters in a model; (6) quantified estimates of the error bars on network parameters and on network output. In the case of classification models, it is shown that the careful incorporation of error bar information into a classifier’s predictions yields improved performance. Comparisons of the inferences of the Bayesian framework with more traditional cross– validation methods help detect poor underlying assumptions in learning models. The relationship of the Bayesian learning framework to ‘active learning’ is examined. Objective functions are discussed which measure the expected informativeness of candidate data measurements, in the context of both interpolation and classification problems. The concepts and methods described in this thesis are quite general and will be applicable to other data modelling problems whether they involve regression, classification or density estimation.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  L. M. M.-T. Theory of Probability , 1929, Nature.

[3]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .

[4]  G. C. Tiao,et al.  A Further Look at Robustness via Bayes's Theorem , 1962 .

[5]  G. C. Tiao,et al.  A Bayesian approach to the importance of assumptions applied to the comparison of variances , 1964 .

[6]  G. C. Tiao,et al.  A bayesian approach to some outlier problems. , 1968, Biometrika.

[7]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[8]  A. M. Walker On the Asymptotic Behaviour of Posterior Distributions , 1969 .

[9]  H. Akaike Statistical predictor identification , 1970 .

[10]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[11]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[12]  I. Good,et al.  Information, weight of evidence, the singularity between probability measures and signal detection , 1974 .

[13]  M. Goldstein Bayesian analysis of regression problems , 1976 .

[14]  R. Kashyap A Bayesian comparison of different classes of dynamic models using empirical data , 1977 .

[15]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[16]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[17]  D. Spiegelhalter,et al.  Bayes Factors and Choice Criteria for Linear Models , 1980 .

[18]  C. S. Wallace,et al.  Archaeoastronomy in the Old World: STONE CIRCLE GEOMETRIES: AN INFORMATION THEORY APPROACH , 1982 .

[19]  A. Zellner,et al.  Basic Issues in Econometrics. , 1986 .

[20]  D. Titterington Common structure of smoothing techniques in statistics , 1985 .

[21]  Tomaso Poggio,et al.  Computational vision and regularization theory , 1985, Nature.

[22]  S. Luttrell The use of transinformation in the design of data sampling schemes for inverse problems , 1985 .

[23]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[24]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[25]  A. R. Davies,et al.  Optimisation in the regularisation ill-posed problems , 1986, The Journal of the Australian Mathematical Society. Series B. Applied Mathematics.

[26]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[27]  J. Justice Maximum entropy and bayesian methods in applied statistics , 1986 .

[28]  Geoffrey E. Hinton,et al.  Learning and relearning in Boltzmann machines , 1986 .

[29]  S. Stigler Laplace's 1774 Memoir on Inverse Probability , 1986 .

[30]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[31]  David Lindley,et al.  Bayesian Statistics, a Review , 1987 .

[32]  J J Hopfield,et al.  Learning algorithms and probability distributions in feed-forward and feed-back networks. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Esther Levin,et al.  Accelerated Learning in Layered Neural Networks , 1988, Complex Syst..

[34]  S. Gull Bayesian Inductive Inference and Maximum Entropy , 1988 .

[35]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[36]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[37]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[38]  J. Skilling The Eigenvalues of Mega-dimensional Matrices , 1989 .

[39]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[40]  M. C. Jones,et al.  Spline Smoothing and Nonparametric Regression. , 1989 .

[41]  Yaser S. Abu-Mostafa,et al.  The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning , 1989, Neural Computation.

[42]  Stephen F. Gull,et al.  Developments in Maximum Entropy Data Analysis , 1989 .

[43]  David S. Touretzky,et al.  Advances in neural information processing systems 2 , 1989 .

[44]  Fernando J. Pineda,et al.  Recurrent Backpropagation and the Dynamical Approach to Adaptive Neural Computation , 1989, Neural Computation.

[45]  Naftali Tishby,et al.  Consistent inference of probabilities in layered networks: predictions and generalizations , 1989, International 1989 Joint Conference on Neural Networks.

[46]  Stephen F. Gull,et al.  Bayesian Data Analysis: Straight-line fitting , 1989 .

[47]  T. Loredo From Laplace to Supernova SN 1987A: Bayesian Inference in Astrophysics , 1990 .

[48]  R. T. Cox Probability, frequency and reasonable expectation , 1990 .

[49]  J. Angel,et al.  Adaptive optics for array telescopes using neural-network techniques , 1990, Nature.

[50]  G. L. Bretthorst Bayesian analysis. I. Parameter estimation using quadrature NMR models , 1990 .

[51]  David J. Spiegelhalter,et al.  Sequential updating of conditional probabilities on directed graphical structures , 1990, Networks.

[52]  Yaser S. Abu-Mostafa,et al.  Learning from hints in neural networks , 1990, J. Complex..

[53]  Yann LeCun,et al.  Transforming Neural-Net Output Levels to Probability Distributions , 1990, NIPS.

[54]  Chuanyi Ji,et al.  Generalizing Smoothness Constraints from Discrete Samples , 1990, Neural Computation.

[55]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[56]  Peter Cheeseman,et al.  Bayesian classification theory , 1991 .

[57]  N. Weir,et al.  Applications of Maximum Entropy Techniques to HST Data , 1991 .

[58]  Steven J. Nowlan,et al.  Soft competitive adaptation: neural network learning algorithms based on fitting statistical mixtures , 1991 .

[59]  John Moody,et al.  Note on generalization, regularization and architecture selection in nonlinear learning systems , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[60]  Mahmoud A. El-Gamal The Role of Priors in Active Bayesian Learning in the Sequential Statistical Decision Framework , 1991 .

[61]  M. Charter Quantifying Drug Absorption , 1991 .

[62]  Eric B. Baum,et al.  Neural net algorithms that learn in polynomial time from examples and queries , 1991, IEEE Trans. Neural Networks.

[63]  D. Mackay,et al.  A Practical Bayesian Framework for Backprop Networks , 1991 .

[64]  Wei Tsih Lee,et al.  On Optimal Adaptive Classifier Design Criterion- How many hidden units are necessary for an optimal neural network classifier? , 1991 .

[65]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[66]  J. Skilling On Parameter Estimation and Quantified Maxent , 1991 .

[67]  Jenq-Neng Hwang,et al.  Query-based learning applied to partially trained multilayer perceptrons , 1991, IEEE Trans. Neural Networks.

[68]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[69]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[70]  Chris Bishop,et al.  Current address: Microsoft Research, , 2022 .

[71]  J. Skilling Bayesian Solution of Ordinary Differential Equations , 1992 .

[72]  James O. Berger,et al.  Ockham's Razor and Bayesian Analysis , 1992 .

[73]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[74]  K. Mark,et al.  Bayesian model selection and minimum description length estimation of auditory-nerve discharge rates. , 1992, The Journal of the Acoustical Society of America.

[75]  Radford M. Neal Bayesian training of backpropagation networks by the hybrid Monte-Carlo method , 1992 .

[76]  Hua Lee,et al.  Maximum Entropy and Bayesian Methods. , 1996 .

[77]  William H. Press,et al.  Numerical recipes in C , 2002 .

[78]  Richard Szeliski,et al.  Bayesian modeling of uncertainty in low-level vision , 2011, International Journal of Computer Vision.