Statistical Theory of Learning Curves under Entropic Loss Criterion

The present paper elucidates a universal property of learning curves, which shows how the generalization error, training error, and the complexity of the underlying stochastic machine are related and how the behavior of a stochastic machine is improved as the number of training examples increases. The error is measured by the entropic loss. It is proved that the generalization error converges to H0, the entropy of the conditional distribution of the true machine, as H0 + m/(2t), while the training error converges as H0 - m/(2t), where t is the number of examples and m shows the complexity of the network. When the model is faithful, implying that the true machine is in the model, m is reduced to m, the number of modifiable parameters. This is a universal law because it holds for any regular machine irrespective of its structure under the maximum likelihood estimator. Similar relations are obtained for the Bayes and Gibbs learning algorithms. These learning curves show the relation among the accuracy of learning, the complexity of a model, and the number of training examples.

[1]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[2]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[3]  H. Akaike A new look at the statistical model identification , 1974 .

[4]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[5]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[6]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[7]  Z. Pawlak Learning from examples , 1986 .

[8]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[9]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[10]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[11]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[12]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[13]  Sompolinsky,et al.  Learning from examples in large neural networks. , 1990, Physical review letters.

[14]  Haim Sompolinsky,et al.  Learning from Examples in a Single-Layer Neural Network , 1990 .

[15]  Kenji Yamanishi,et al.  A learning criterion for stochastic rules , 1990, COLT '90.

[16]  Heskes,et al.  Learning processes in neural networks. , 1991, Physical review. A, Atomic, molecular, and optical physics.

[17]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[18]  Kenji Yamanishi A loss bound model for on-line stochastic prediction strategies , 1991, COLT '91.

[19]  Shun-ichi Amari,et al.  Dualistic geometry of the manifold of higher-order neurons , 1991, Neural Networks.

[20]  David Haussler,et al.  Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise , 1991, COLT '91.

[21]  Shun-ichi Amari,et al.  Four Types of Learning Curves , 1992, Neural Computation.

[22]  S. Amari Universal property of learning curves under entropy loss , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[23]  Alexander Borst,et al.  Two-Dimensional Motion Perception in Flies , 1993, Neural Computation.

[24]  Terrence J. Sejnowski,et al.  The Variance of Covariance Rules for Associative Matrix Memories and Reinforcement Learning , 1993, Neural Computation.

[25]  L. F. Abbott,et al.  Analysis of Neuron Models with Dynamically Regulated Conductances , 1993, Neural Computation.

[26]  Markus Lappe,et al.  A Neural Network for the Processing of Optic Flow from Ego-Motion in Man and Higher Mammals , 1993, Neural Computation.

[27]  Peter R. Conwell,et al.  Universal Approximation by Phase Series and Fixed-Weight Networks , 1993, Neural Computation.

[28]  David S. Touretzky,et al.  Neural Representation of Space Using Sinusoidal Arrays , 1993, Neural Computation.

[29]  James M. Bower,et al.  Sensitivity of Synaptic Plasticity to the Ca2+ Permeability of NMDA Channels: A Model of Long-Term Potentiation in Hippocampal Neurons , 1993, Neural Computation.

[30]  Joseph J. Atick,et al.  Convergent Algorithm for Sensory Receptive Field Development , 1993, Neural Computation.

[31]  Dominique Martinez,et al.  On an Unsupervised Learning Rule for Scalar Quantization following the Maximum Entropy Principle , 1993, Neural Computation.

[32]  Thomas M. Fischer,et al.  A Neural Network Model of Inhibitory Information Processing in Aplysia , 1993, Neural Computation.

[33]  Robert Hecht-Nielsen,et al.  On the Geometry of Feedforward Neural Network Error Surfaces , 1993, Neural Computation.

[34]  Somnath Mukhopadhyay,et al.  A Polynomial Time Algorithm for Generating Neural Networks for Pattern Classification: Its Stability Properties and Some Test Results , 1993, Neural Computation.

[35]  Thelma L. Williams,et al.  The Effects of Cell Duplication and Noise in a Pattern Generating Network , 1993, Neural Computation.

[36]  José R. Dorronsoro,et al.  Recurrent and Feedforward Polynomial Modeling of Coupled Time Series , 1993, Neural Computation.

[37]  Daniel Seligson,et al.  Fast Recognition of Noisy Digits , 1993, Neural Computation.

[38]  James A. Kottas Training Periodic Sequences Using Fourier Series Error Criterion , 1993, Neural Computation.

[39]  Bruce Denby,et al.  The Use of Neural Networks in High-Energy Physics , 1993, Neural Computation.

[40]  Toshiaki Tanaka,et al.  The Characteristics of the Convergence Time of Associative Neural Networks , 1993, Neural Computation.

[41]  A. Norman Redlich,et al.  Redundancy Reduction as a Strategy for Unsupervised Learning , 1993, Neural Computation.

[42]  Nathan Intrator,et al.  Combining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural Networks , 1993, Neural Computation.

[43]  Pekka Orponen,et al.  Attraction Radii in Binary Hopfield Nets are Hard to Compute , 1993, Neural Computation.

[44]  Dhananjay S. Phatak,et al.  Construction of Minimal n-2-n Encoders for Any n , 1993, Neural Computation.

[45]  Leslie S. Smith,et al.  Constraints on Synchronizing Oscillator Networks , 1993, Neural Computation.

[46]  Sollich Query construction, entropy, and generalization in neural-network models. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[47]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[48]  Kenji Yamanishi,et al.  A Loss Bound Model for On-Line Stochastic Prediction Algorithms , 1995, Inf. Comput..

[49]  M. B. Gordon,et al.  Learning with a Temperature-Dependent Algorithm , 1995 .

[50]  Opper,et al.  Bounds for predictive errors in the statistical mechanics of supervised learning. , 1995, Physical review letters.