论文信息 - Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

Overfitting in Neural Nets: Backpropagation, Conjugate Gradient, and Early Stopping

The conventional wisdom is that backprop nets with excess hidden units generalize poorly. We show that nets with excess capacity generalize well when trained with backprop and early stopping. Experiments suggest two reasons for this: 1) Overfitting can vary significantly in different regions of the model. Excess capacity allows better fit to regions of high non-linearity, and backprop often avoids overfitting the regions of low non-linearity. 2) Regardless of size, nets learn task subcomponents in similar sequence. Big nets pass through stages similar to those learned by smaller nets. Early stopping can stop training the large net when it generalizes comparably to a smaller net. We also show that conjugate gradient can yield worse generalization because it overfits regions of low non-linearity when learning to fit regions of high non-linearity.

Rich Caruana | C. Lee Giles | Steve Lawrence | S. Lawrence | R. Caruana

[1] Terrence J. Sejnowski,et al. Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[2] Dean Pomerleau,et al. ALVINN, an autonomous land vehicle in a neural network , 2015 .

[3] David Haussler,et al. What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[4] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.

[5] John E. Moody,et al. Note on Learning Rate Schedules for Stochastic Optimization , 1990, NIPS.

[6] David E. Rumelhart,et al. Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[7] Anders Krogh,et al. A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[8] James A. Pittman,et al. Recognizing Hand-Printed Letters and Digits Using Backpropagation Learning , 1991, Neural Computation.

[9] John E. Moody,et al. The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[10] Elie Bienenstock,et al. Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[11] Andreas Weigend,et al. On overfitting and the effective number of hidden units , 1993 .

[12] Peter L. Bartlett,et al. For Valid Generalization the Size of the Weights is More Important than the Size of the Network , 1996, NIPS.

[13] David H. Wolpert,et al. On Bias Plus Variance , 1997, Neural Computation.