Asymptotic statistical theory of overtraining and cross-validation

A statistical theory for overtraining is proposed. The analysis treats general realizable stochastic neural networks, trained with Kullback-Leibler divergence in the asymptotic case of a large number of training examples. It is shown that the asymptotic gain in the generalization error is small if we perform early stopping, even if we have access to the optimal stopping time. Based on the cross-validation stopping we consider the ratio the examples should be divided into training and cross-validation sets in order to obtain the optimum performance. Although cross-validated early stopping is useless in the asymptotic region, it surely decreases the generalization error in the nonasymptotic region. Our large scale simulations done on a CM5 are in good agreement with our analytical findings.

[1]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[2]  H. Akaike A new look at the statistical model identification , 1974 .

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[5]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[6]  Heskes,et al.  Learning processes in neural networks. , 1991, Physical review. A, Atomic, molecular, and optical physics.

[7]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[8]  Shun-ichi Amari,et al.  Learning Curves, Model Selection and Complexity of Neural Networks , 1992, NIPS.

[9]  J. Hertz,et al.  Generalization in a linear perceptron in the presence of noise , 1992 .

[10]  Shun-ichi Amari,et al.  Statistical Theory of Learning Curves under Entropic Loss Criterion , 1993, Neural Computation.

[11]  Opper,et al.  Generalization ability of perceptrons with continuous outputs. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[12]  Michael Finke,et al.  Estimating A-Posteriori Probabilities using Stochastic Network Models , 1993 .

[13]  Shun-ichi Amari,et al.  A universal theorem on learning curves , 1993, Neural Networks.

[14]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[15]  H. Sebastian Seung,et al.  On-line Learning of Dichotomies , 1994, NIPS.

[16]  Christopher M. Bishop,et al.  Regularization and complexity control in feed-forward networks , 1995 .

[17]  L. Ljung,et al.  Overtraining, regularization and searching for a minimum, with application to neural networks , 1995 .

[18]  D. Saad,et al.  FINITE-SIZE EFFECTS AND OPTIMAL TEST SET SIZE IN LINEAR PERCEPTRONS , 1995 .

[19]  J. Stephen Judd,et al.  Optimal stopping and effective machine complexity in learning , 1993, Proceedings of 1995 IEEE International Symposium on Information Theory.

[20]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[21]  David Barber,et al.  Test Error Fluctuations in Finite Linear Perceptrons , 1995, Neural Computation.

[22]  M.H. Hassoun,et al.  Fundamentals of Artificial Neural Networks , 1996, Proceedings of the IEEE.

[23]  Klaus Schulten,et al.  A Numerical Study on Learning Curves in Stochastic Multilayer Feedforward Networks , 1996, Neural Computation.

[24]  Michael Kearns,et al.  A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split , 1995, Neural Computation.

[25]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.