Natural Gradient Works Eciently in Learning

When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation) and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient on-line learning is analyzed and is proved to be Fisher ecient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon which appears in the backpropagation learning algorithm of multilayer perceptrons might disappear or might be not so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed.

[1]  S. Amari A Theory ofAdaptive Pattern Classifiers , 1967 .

[2]  I︠a︡. Z. T︠S︡ypkin,et al.  Foundations of the theory of learning systems , 1973 .

[3]  G. Pflug Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, HJ.; Clark, D.S. , 1980 .

[4]  L. L. Campbell,et al.  The relation between information theory and the differential geometry approach to statistics , 1985, Inf. Sci..

[5]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[6]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[7]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[8]  Heskes,et al.  Learning processes in neural networks. , 1991, Physical review. A, Atomic, molecular, and optical physics.

[9]  C. R. Rao,et al.  Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[10]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[11]  M. Murray,et al.  Differential Geometry and Statistics , 1993 .

[12]  Shun-ichi Amari,et al.  Statistical Theory of Learning Curves under Entropic Loss Criterion , 1993, Neural Computation.

[13]  Shun-ichi Amari,et al.  A universal theorem on learning curves , 1993, Neural Networks.

[14]  J. Nadal,et al.  Nonlinear neurons in the low-noise limit: a factorial code maximizes information transfer Network 5 , 1994 .

[15]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[16]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[17]  Sompolinsky,et al.  Local and global convergence of on-line learning. , 1995, Physical review letters.

[18]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[19]  Erkki Oja,et al.  Signal Separation by Nonlinear Hebbian Learning , 1995 .

[20]  Haim Sompolinsky,et al.  On-line Learning of Dichotomies: Algorithms and Learning Curves. , 1995, NIPS 1995.

[21]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[22]  Opper On-line versus Off-line Learning from Random Examples: General Results. , 1996, Physical review letters.

[23]  C Van Den Broeck Unsupervised Learning by Examples: On-line versus Oo-line , 1996 .

[24]  Jean-François Cardoso,et al.  Equivariant adaptive source separation , 1996, IEEE Trans. Signal Process..

[25]  Andreas Ziehe,et al.  Adaptive On-line Learning in Changing Environments , 1996, NIPS.

[26]  S. Amari,et al.  Fast Converging Filtered Regressor Algorithms for Blind Equalization , 1996 .

[27]  Shun-ichi Amari,et al.  Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[28]  Shun-ichi Amari,et al.  Blind source separation-semiparametric statistical approach , 1997, IEEE Trans. Signal Process..

[29]  S. Amari,et al.  Estimating Functions in Semiparametric Statistical Models , 1997 .

[30]  S.C. Douglas,et al.  Multichannel blind deconvolution and equalization using the natural gradient , 1997, First IEEE Signal Processing Workshop on Signal Processing Advances in Wireless Communications.

[31]  Shun-ichi Amari,et al.  Adaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information , 1997, Neural Computation.

[32]  Andrzej Cichocki,et al.  Stability Analysis of Learning Algorithms for Blind Source Separation , 1997, Neural Networks.

[33]  Shun-ichi Amari,et al.  Learning and statistical inference , 1998 .

[34]  N. Čencov Statistical Decision Rules and Optimal Inference , 2000 .

[35]  S.-I. Amari,et al.  Neural theory of association and concept-formation , 1977, Biological Cybernetics.

[36]  Shun-ichi Amari,et al.  Differential geometry of a parametric family of invertible linear systems—Riemannian metric, dual affine connections, and divergence , 1987, Mathematical systems theory.

[37]  S. Amari Supereciency in Blind Source Separation , 2022 .