Flat Minima

We present a new algorithm for finding low-complexity neural networks with high generalization capability. The algorithm searches for a flat minimum of the error function. A flat minimum is a large connected region in weight space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that flat minima correspond to simple networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a good weight prior. Instead we have a prior over input output functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second-order derivatives, it has backpropagation's order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and optimal brain surgeon/optimal brain damage.

[1]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, Bell Syst. Tech. J..

[2]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[3]  H. Akaike Statistical Predictor Identification , 1970 .

[4]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1974 .

[5]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[6]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, at - Automatisierungstechnik.

[7]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[8]  John E. Moody,et al.  Fast Learning in Multi-Resolution Hierarchies , 1988, NIPS.

[9]  Lorien Y. Pratt,et al.  Comparing Biases for Minimal Network Construction with Back-Propagation , 1988, NIPS.

[10]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, COLT '89.

[11]  Timur Ash,et al.  Dynamic Node Creation in Backpropagation Networks , 1989 .

[12]  Michael J. Carter,et al.  Operational Fault Tolerance of CMAC Networks , 1989, NIPS.

[13]  B. Yandell Spline smoothing and nonparametric regression , 1989 .

[14]  M. C. Jones,et al.  Spline Smoothing and Nonparametric Regression. , 1989 .

[15]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[16]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, COLT '89.

[17]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[18]  Barak A. Pearlmutter,et al.  Chaitin-Kolmogorov Complexity and Generalization in Neural Networks , 1990, NIPS.

[19]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[20]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[21]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[22]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[23]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[24]  D. Mackay A Practical Bayesian Framework for Backprop Networks , 1991 .

[25]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[26]  David Haussler,et al.  Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise , 1991, Annual Conference Computational Learning Theory.

[27]  Kiyotoshi Matsuoka,et al.  Noise injection into inputs in back-propagation learning , 1992, IEEE Trans. Syst. Man Cybern..

[28]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[29]  Alan F. Murray,et al.  Synaptic Weight Noise During MLP Learning Enhances Fault-Tolerance, Generalization and Learning Trajectory , 1992, NIPS.

[30]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[31]  David J. C. MacKay A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[32]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[33]  Chalapathy Neti,et al.  Maximally fault tolerant neural networks , 1992, IEEE Trans. Neural Networks.

[34]  John E. Moody,et al.  Fast Pruning Using Principal Components , 1993, NIPS.

[35]  Geoffrey E. Hinton,et al.  Keeping Neural Networks Simple , 1993 .

[36]  M. Møller Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time , 1993 .

[37]  Christopher M. Bishop,et al.  Curvature-driven smoothing: a learning algorithm for feedforward networks , 1993, IEEE Trans. Neural Networks.

[38]  Shun-ichi Amari,et al.  Statistical Theory of Learning Curves under Entropic Loss Criterion , 1993, Neural Computation.

[39]  David H. Wolpert,et al.  Bayesian Backpropagation Over I-O Functions Rather Than Weights , 1993, NIPS.

[40]  F. Vallet,et al.  Robustness in Multilayer Perceptrons , 1993, Neural Computation.

[41]  Sean B. Holden,et al.  On the theory of generalization and self-structuring in linearly weighted connectionist networks , 1993 .

[42]  J. Urgen Schmidhuber Discovering Problem Solutions with Low Kolmogorov Complexity and High Generalization Capability , 1994 .

[43]  Achilleas Zapranis,et al.  Stock performance modeling using neural networks: A comparative study with regression models , 1994, Neural Networks.

[44]  Juergen Schmidhuber On learning how to learn learning strategies , 1994 .

[45]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[46]  Ali A. Minai,et al.  Perturbation response in feedforward networks , 1994, Neural Networks.

[47]  John Moody,et al.  Architecture Selection Strategies for Neural Networks: Application to Corporate Bond Rating Predicti , 1995, NIPS 1995.

[48]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[49]  J. Stephen Judd,et al.  Optimal stopping and effective machine complexity in learning , 1993, Proceedings of 1995 IEEE International Symposium on Information Theory.

[50]  David H. Wolpert,et al.  The Relationship Between PAC, the Statistical Physics Framework, the Bayesian Framework, and the VC Framework , 1995 .

[51]  Wolpert, D. (1994a). The relationship between PAC, the Statistical Physics framework, the Bayesian , .