Singularities Affect Dynamics of Learning in Neuromanifolds

The parameter spaces of hierarchical systems such as multilayer perceptrons include singularities due to the symmetry and degeneration of hidden units. A parameter space forms a geometrical manifold, called the neuromanifold in the case of neural networks. Such a model is identified with a statistical model, and a Riemannian metric is given by the Fisher information matrix. However, the matrix degenerates at singularities. Such a singular structure is ubiquitous not only in multilayer perceptrons but also in the gaussian mixture probability densities, ARMA time-series model, and many other cases. The standard statistical paradigm of the Cramr-Rao theorem does not hold, and the singularity gives rise to strange behaviors in parameter estimation, hypothesis testing, Bayesian inference, model selection, and in particular, the dynamics of learning from examples. Prevailing theories so far have not paid much attention to the problem caused by singularity, relying only on ordinary statistical theories developed for regular (nonsingular) models. Only recently have researchers remarked on the effects of singularity, and theories are now being developed. This article gives an overview of the phenomena caused by the singularities of statistical manifolds related to multilayer perceptrons and gaussian mixtures. We demonstrate our recent results on these problems. Simple toy models are also used to show explicit solutions. We explain that the maximum likelihood estimator is no longer subject to the gaussian distribution even asymptotically, because the Fisher information matrix degenerates, that the model selection criteria such as AIC, BIC, and MDL fail to hold in these models, that a smooth Bayesian prior becomes singular in such models, and that the trajectories of dynamics of learning are strongly affected by the singularity, causing plateaus or slow manifolds in the parameter space. The natural gradient method is shown to perform well because it takes the singular geometrical structure into account. The generalization error and the training error are studied in some examples.

[1]  H. Weyl On the Volume of Tubes , 1939 .

[2]  H. Hotelling Tubes and Spheres in n-Spaces, and a Class of Statistical Problems , 1939 .

[3]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[4]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[5]  H. Akaike A new look at the statistical model identification , 1974 .

[6]  Second-Order Systems Some Geometric Questions in the Theory of Linear Systems , 1976 .

[7]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[8]  J. Hartigan A failure of likelihood asymptotics for normal mixtures , 1985 .

[9]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[10]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[11]  Héctor J. Sussmann,et al.  Uniqueness of the weights for minimal feedforward nets with a given input-output map , 1992, Neural Networks.

[12]  Shun-ichi Amari,et al.  Statistical Theory of Learning Curves under Entropic Loss Criterion , 1993, Neural Computation.

[13]  Robert Hecht-Nielsen,et al.  On the Geometry of Feedforward Neural Network Error Surfaces , 1993, Neural Computation.

[14]  Katsuyuki Hagiwara,et al.  On the problem of applying AIC to determine the structure of a layered feedforward neural network , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[15]  Oh,et al.  Generalization in a two-layer neural network. , 1993, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[16]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[17]  Paul C. Kainen,et al.  Functionally Equivalent Feedforward Neural Networks , 1994, Neural Computation.

[18]  Michael Biehl,et al.  On-line backpropagation in two-layered neural networks , 1995 .

[19]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[20]  E. Gassiat,et al.  Testing in locally conic models, and application to mixture models , 1997 .

[21]  Magnus Rattray,et al.  Natural gradient descent for on-line learning , 1998 .

[22]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[23]  K. Fukumizu Generalization Error of Linear Neural Networks in Unidentiable Cases , 1999 .

[24]  M. Rattray,et al.  Analysis of natural gradient descent for multilayer neural networks , 1999, cond-mat/9901212.

[25]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[26]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[27]  Sumio Watanabe Algebraic Information Geometry for Learning Machines with Singularities , 2000, NIPS.

[28]  Sumio Watanabe Algebraic Analysis for Non-identifiable Learning Machines , 2000 .

[29]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[30]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[31]  Hilbert J. Kappen,et al.  Nonmonotonic Generalization Bias of Gaussian Mixture Models , 2000, Neural Computation.

[32]  Shun-ichi Amari,et al.  Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons , 2001, NIPS.

[33]  Si Wu,et al.  Population Coding with Correlation and an Unfaithful Model , 2001, Neural Computation.

[34]  Katsuyuki Hagiwara,et al.  Upper bound of the expected training error of neural network regression for a Gaussian noise sequence , 2001, Neural Networks.

[35]  S. Amari,et al.  Differential and Algebraic Geometry of Multilayer Perceptrons , 2001 .

[36]  Sumio Watanabe Algebraic geometrical methods for hierarchical learning machines , 2001, Neural Networks.

[37]  Sumio Watanabe,et al.  A Probabilistic Algorithm to Calculate the Learning Curves of Hierarchical Learning Machines with Singularities , 2002 .

[38]  Katsuyuki Hagiwara,et al.  Regularization learning, early stopping and biased estimator , 2002, Neurocomputing.

[39]  Si Wu,et al.  Population Coding and Decoding in a Neural Field: A Computational Study , 2002, Neural Computation.

[40]  Katsuyuki Hagiwara On the Problem in Model Selection of Neural Network Regression in Overrealizable Scenario , 2002, Neural Computation.

[41]  Sumio Watanabe,et al.  Singularities in mixture models and upper bounds of stochastic complexity , 2003, Neural Networks.

[42]  Shun-ichi Amari,et al.  Learning Coefficients of Layered Models When the True Distribution Mismatches the Singularities , 2003, Neural Computation.

[43]  Masato Okada,et al.  On-Line Learning Dynamics of Multilayer Perceptrons with Unidentifiable Parameters , 2003 .

[44]  K. Fukumizu Likelihood ratio of unidentifiable models and multilayer neural networks , 2003 .

[45]  Shun-ichi Amari New Consideration on Criteria of Model Selection , 2003 .

[46]  Shun-ichi Amari,et al.  Learning and inference in hierarchical models with singularities , 2003, Systems and Computers in Japan.

[47]  Naohiro Toda,et al.  On the Statistical Properties of Least Squares Estimators of Layered Neural Networks , 2003 .

[48]  Hyeyoung Park,et al.  On-Line Learning Theory of Soft Committee Machines with Correlated Hidden Units : Steepest Gradient Descent and Natural Gradient Descent , 2002, cond-mat/0212006.

[49]  Shun-ichi Amari,et al.  On Some Singularities in Parameter Estimation Problems , 2003, Probl. Inf. Transm..

[50]  Y. Shao,et al.  Asymptotics for likelihood ratio tests under loss of identifiability , 2003 .

[51]  S. Amari Dynamics of pattern formation in lateral-inhibition type neural fields , 1977, Biological Cybernetics.

[52]  Stefan M. Rüger,et al.  The Metric Structure of Weight Space , 1997, Neural Processing Letters.

[53]  Shun-ichi Amari,et al.  Difficulty of Singularity in Population Coding , 2005, Neural Computation.

[54]  Shun-ichi Amari,et al.  Differential geometry of a parametric family of invertible linear systems—Riemannian metric, dual affine connections, and divergence , 1987, Mathematical systems theory.