Bayesian Regularization and Pruning Using a Laplace Prior

Standard techniques for improved generalization from neural networks include weight decay and pruning. Weight decay has a Bayesian interpretation with the decay function corresponding to a prior over weights. The method of transformation groups and maximum entropy suggests a Laplace rather than a gaussian prior. After training, the weights then arrange themselves into two classes: (1) those with a common sensitivity to the data error and (2) those failing to achieve this sensitivity and that therefore vanish. Since the critical value is determined adaptively during training, pruningin the sense of setting weights to exact zerosbecomes an automatic consequence of regularization alone. The count of free parameters is also reduced automatically as weights are pruned. A comparison is made with results of MacKay using the evidence framework and a gaussian regularizer.

[1]  W. Dearborn Experiments in learning. , 1910 .

[2]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[3]  Philip E. Gill,et al.  Practical optimization , 1981 .

[4]  Geoffrey E. Hinton,et al.  Experiments on Learning by Back Propagation. , 1986 .

[5]  Roger Fletcher,et al.  Practical methods of optimization; (2nd ed.) , 1987 .

[6]  Lawrence D. Jackel,et al.  Large Automatic Learning, Rule Extraction, and Generalization , 1987, Complex Syst..

[7]  R. Fletcher Practical Methods of Optimization , 1988 .

[8]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[9]  M. Møller A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning , 1990 .

[10]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[11]  Geoffrey E. Hinton,et al.  Adaptive Soft Weight Tying using Gaussian Mixtures , 1991, NIPS.

[12]  D. Mackay,et al.  A Practical Bayesian Framework for Backprop Networks , 1991 .

[13]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[14]  David H. Wolpert,et al.  On the Use of Evidence in Neural Networks , 1992, NIPS.

[15]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[16]  C. M. Bishop,et al.  Curvature-Driven Smoothing in Backpropagation Neural Networks , 1992 .

[17]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[18]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[19]  Radford M. Neal Bayesian Learning via Stochastic Dynamics , 1992, NIPS.

[20]  Radford M. Neal Bayesian training of backpropagation networks by the hybrid Monte-Carlo method , 1992 .

[21]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[22]  M. F. Møller,et al.  Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time , 1993 .

[23]  Christopher M. Bishop,et al.  Curvature-driven smoothing: a learning algorithm for feedforward networks , 1993, IEEE Trans. Neural Networks.

[24]  H. H. Thodberg Ace of Bayes : Application of Neural , 1993 .

[25]  P. M. Williams Improved generalization and network pruning using adaptive Laplace regularization , 1993 .

[26]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[27]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[28]  P. M. Williams,et al.  Using Neural Networks to Model Conditional Multivariate Densities , 1996, Neural Computation.

[29]  Tor Arne Johansen,et al.  Identification of non-linear systems using empirical data and prior knowledge - an optimization approach , 1996, Autom..