Statistical mechanical analysis of learning dynamics of two-layer perceptron with multiple output units

The plateau phenomenon, wherein the loss value stops decreasing during the process of learning, is troubling. Various studies suggest that the plateau phenomenon is frequently caused by the network being trapped in the singular region on the loss surface, a region that stems from the symmetrical structure of neural networks. However, these studies all deal with networks that have a one-dimensional output, and networks with a multidimensional output are overlooked. This paper uses a statistical mechanical formalization to analyze the dynamics of learning in a two-layer perceptron with multidimensional output. We derive order parameters that capture macroscopic characteristics of connection weights and the differential equations that they follow. We show that singular-region-driven plateaus diminish or vanish with multidimensional output, in a simple setting. We found that the more non-degenerative (i.e. far from one-dimensional output) the model is, the more plateaus are alleviated. Furthermore, we showed theoretically that singular-region-driven plateaus seldom occur in the learning process in the case of orthogonalized initializations.

[1]  Masato Okada,et al.  Dynamics of Learning in MLP: Natural Gradient and Singularity Revisited , 2018, Neural Computation.

[2]  S. Amari,et al.  Statistical Mechanical Analysis of Online Learning with Weight Normalization in Single Layer Perceptron , 2017 .

[3]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[4]  Shun-ichi Amari,et al.  Dynamics of Learning in Multilayer Perceptrons Near Singularities , 2008, IEEE Transactions on Neural Networks.

[5]  Shun-ichi Amari,et al.  Dynamics of Learning Near Singularities in Layered Networks , 2008, Neural Computation.

[6]  S. Amari,et al.  Singularities Affect Dynamics of Learning in Neuromanifolds , 2006, Neural Computation.

[7]  Hyeyoung Park,et al.  Slow Dynamics Due to Singularities of Hierarchical Learning Machines , 2005 .

[8]  Tien Yien Li,et al.  The theory of chaotic attractors , 2004 .

[9]  Hyeyoung Park,et al.  On-Line Learning Theory of Soft Committee Machines with Correlated Hidden Units : Steepest Gradient Descent and Natural Gradient Descent , 2002, cond-mat/0212006.

[10]  S. Amari,et al.  Differential and Algebraic Geometry of Multilayer Perceptrons , 2001 .

[11]  Nam-jin Huh,et al.  On-line learning of a mixture-of-experts neural network , 2000 .

[12]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[13]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[14]  Shun-ichi Amari,et al.  Methods of information geometry , 2000 .

[15]  David Saad,et al.  Online Learning in Radial Basis Function Networks , 1997, Neural Computation.

[16]  Michael Biehl,et al.  Transient dynamics of on-line learning in two-layered neural networks , 1996 .

[17]  Michael Biehl,et al.  On-line backpropagation in two-layered neural networks , 1995 .

[18]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[19]  Michael Biehl,et al.  Learning by on-line gradient descent , 1995 .

[20]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[21]  J. Milnor On the concept of attractor , 1985 .

[22]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.