论文信息 - A Simple Weight Decay Can Improve Generalization

A Simple Weight Decay Can Improve Generalization

It has been observed in numerical simulations that a weight decay can improve generalization in a feed-forward neural network. This paper explains why. It is proven that a weight decay has two effects in a linear network. First, it suppresses any irrelevant components of the weight vector by choosing the smallest vector that solves the learning problem. Second, if the size is chosen right, a weight decay can suppress some of the effects of static noise on the targets, which improves generalization quite a lot. It is then shown how to extend these results to networks with hidden layers and non-linear units. Finally the theory is confirmed by some numerical simulations using the data from NetTalk.

Anders Krogh | John A. Hertz | J. Hertz | A. Krogh

[1] Geoffrey E. Hinton. Learning Translation Invariant Recognition in Massively Parallel Networks , 1987, PARLE.

[2] Terrence J. Sejnowski,et al. Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[3] David Haussler,et al. What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[4] Naftali Tishby,et al. Consistent inference of probabilities in layered networks: predictions and generalizations , 1989, International 1989 Joint Conference on Neural Networks.

[5] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.

[6] Vijay K. Samalam,et al. Exhaustive Learning , 1990, Neural Computation.

[7] David E. Rumelhart,et al. Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[8] Anders Krogh,et al. Introduction to the theory of neural computation , 1994, The advanced book program.

[9] Hans Henrik Thodberg,et al. Improving Generalization of Neural Networks Through Pruning , 1991, Int. J. Neural Syst..

[10] D. Mackay,et al. A Practical Bayesian Framework for Backprop Networks , 1991 .

[11] J. Hertz,et al. Generalization in a linear perceptron in the presence of noise , 1992 .