Early Stopping - But When?

Validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting (“early stopping”). The exact criterion used for validation-based early stopping, however, is usually chosen in an ad-hoc fashion or training is stopped interactively. This trick describes how to select a stopping criterion in a systematic fashion; it is a trick for either speeding learning procedures or improving generalization, whichever is more important in the particular situation. An empirical investigation on multi-layer perceptrons shows that there exists a tradeoff between training time and generalization: From the given mix of 1296 training runs using different 12 problems and 24 different network architectures I conclude slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average).

[1]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[2]  Hervé Bourlard,et al.  Generalization and Parameter Estimation in Feedforward Netws: Some Experiments , 1989, NIPS.

[3]  Christian Lebiere,et al.  The Cascade-Correlation Learning Architecture , 1989, NIPS.

[4]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[5]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[6]  David E. Rumelhart,et al.  Generalization by Weight-Elimination with Application to Forecasting , 1990, NIPS.

[7]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[8]  Pierre Baldi,et al.  Temporal Evolution of Generalization during Learning in Linear Networks , 1991, Neural Computation.

[9]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[10]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[11]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[12]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[13]  John E. Moody,et al.  Fast Pruning Using Principal Components , 1993, NIPS.

[14]  Ferdinand Hergert,et al.  Improving model selection by nonconvergent methods , 1993, Neural Networks.

[15]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[16]  E. Fiesler,et al.  Comparative Bibliography of Ontogenic Neural Networks , 1994 .

[17]  Lutz Prechelt,et al.  PROBEN 1 - a set of benchmarks and benchmarking rules for neural network training algorithms , 1994 .

[18]  Klaus-Robert Müller,et al.  Statistical Theory of Overtraining - Is Cross-Validation Asymptotically Effective? , 1995, NIPS.

[19]  J. Stephen Judd,et al.  Optimal stopping and effective machine complexity in learning , 1993, Proceedings of 1995 IEEE International Symposium on Information Theory.

[20]  Klaus-Robert Müller,et al.  Asymptotic statistical theory of overtraining and cross-validation , 1997, IEEE Trans. Neural Networks.