Natural Gradient Works Efficiently in Learning

When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction, but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation), and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptrons, might disappear or might not be so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed.

[1]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[2]  I︠a︡. Z. T︠S︡ypkin,et al.  Foundations of the theory of learning systems , 1973 .

[3]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[4]  L. L. Campbell,et al.  The relation between information theory and the differential geometry approach to statistics , 1985, Information Sciences.

[5]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[6]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[7]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Processing.

[8]  Heskes,et al.  Learning processes in neural networks. , 1991, Physical review. A, Atomic, molecular, and optical physics.

[9]  Shun-ichi Amari,et al.  Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[10]  C. R. Rao Information and the Accuracy Attainable in the Estimation of Statistical Parameters , 1992 .

[11]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[12]  M. Murray,et al.  Differential Geometry and Statistics , 1993 .

[13]  Shun-ichi Amari,et al.  Statistical Theory of Learning Curves under Entropic Loss Criterion , 1993, Neural Computation.

[14]  Shun-ichi Amari,et al.  A universal theorem on learning curves , 1993, Neural Networks.

[15]  J. Nadal,et al.  Nonlinear neurons in the low-noise limit: a factorial code maximizes information transfer Network 5 , 1994 .

[16]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Processing.

[17]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[18]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[19]  Sompolinsky,et al.  Local and global convergence of on-line learning. , 1995, Physical review letters.

[20]  Marimuthu Palaniswami,et al.  Computational Intelligence: A Dynamic System Perspective , 1995 .

[21]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[22]  Erkki Oja,et al.  Signal Separation by Nonlinear Hebbian Learning , 1995 .

[23]  Haim Sompolinsky,et al.  On-line Learning of Dichotomies: Algorithms and Learning Curves. , 1995, NIPS 1995.

[24]  Andrzej Cichocki,et al.  A New Learning Algorithm for Blind Signal Separation , 1995, NIPS.

[25]  Opper On-line versus Off-line Learning from Random Examples: General Results. , 1996, Physical review letters.

[26]  Reimann,et al.  Unsupervised learning by examples: On-line versus off-line. , 1996, Physical review letters.

[27]  S. Amari,et al.  Fast-convergence filtered regressor algorithms for blind equalisation , 1996 .

[28]  Jean-François Cardoso,et al.  Equivariant adaptive source separation , 1996, IEEE Trans. Signal Process..

[29]  Andreas Ziehe,et al.  Adaptive On-line Learning in Changing Environments , 1996, NIPS.

[30]  Shun-ichi Amari,et al.  Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[31]  Shun-ichi Amari,et al.  Blind source separation-semiparametric statistical approach , 1997, IEEE Trans. Signal Process..

[32]  S. Amari,et al.  Estimating Functions in Semiparametric Statistical Models , 1997 .

[33]  S.C. Douglas,et al.  Multichannel blind deconvolution and equalization using the natural gradient , 1997, First IEEE Signal Processing Workshop on Signal Processing Advances in Wireless Communications.

[34]  Shun-ichi Amari,et al.  Adaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information , 1997, Neural Computation.

[35]  Andrzej Cichocki,et al.  Stability Analysis of Learning Algorithms for Blind Source Separation , 1997, Neural Networks.

[36]  S. Amari,et al.  Information geometry of estimating functions in semi-parametric statistical models , 1997 .

[37]  Shun-ichi Amari,et al.  Learning and statistical inference , 1998 .

[38]  Magnus Rattray,et al.  Natural gradient descent for on-line learning , 1998 .

[39]  Shun-ichi Amari,et al.  Adaptive blind signal processing-neural network approaches , 1998, Proc. IEEE.

[40]  Qin Lin,et al.  A unified algorithm for principal and minor components extraction , 1998, Neural Networks.

[41]  T. Ens Blind signal separation : statistical principles , 1998 .

[42]  Richard Hans Robert Hahnloser Learning algorithms based on linearization. , 1998, Network.


[44]  Shun-ichi Amari,et al.  Complexity Issues in Natural Gradient Descent Method for Training Multilayer Perceptrons , 1998, Neural Computation.

[45]  Mark A. Girolami,et al.  An Alternative Perspective on Adaptive Independent Component Analysis Algorithms , 1998, Neural Computation.

[46]  Shun-ichi Amari,et al.  Blind separation of uniformly distributed signals: a general approach , 1999, IEEE Trans. Neural Networks.

[47]  Liqing Zhang,et al.  Natural gradient algorithm for blind separation of overdetermined mixture with additive noise , 1999, IEEE Signal Processing Letters.

[48]  Terrence J. Sejnowski,et al.  Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources , 1999, Neural Computation.

[49]  H. H. Yang,et al.  Serial updating rule for blind separation derived from the method of scoring , 1999, IEEE Trans. Signal Process..

[50]  Filipe Aires,et al.  Analyse de séries temporelles géophysiques et théorie de l'information: L'analyse en composantes indépendantes , 1999 .

[51]  Shun-ichi Amari,et al.  Superefficiency in blind source separation , 1999, IEEE Trans. Signal Process..

[52]  Peter Dayan,et al.  Recurrent Sampling Models for the Helmholtz Machine , 1999, Neural Computation.

[53]  Terrence J. Sejnowski,et al.  Blind source separation of more sources than mixtures using overcomplete representations , 1999, IEEE Signal Processing Letters.

[54]  Shun-ichi Amari,et al.  Blind Separation of a Mixture of Uniformly Distributed Source Signals: A Novel Approach , 1999, Neural Computation.

[55]  M. Rattray,et al.  Analysis of natural gradient descent for multilayer neural networks , 1999, cond-mat/9901212.

[56]  Christian Jutten,et al.  Source separation in post-nonlinear mixtures , 1999, IEEE Trans. Signal Process..

[57]  Scott C. Douglas,et al.  Equivariant adaptive selective transmission , 1999, IEEE Trans. Signal Process..

[58]  Tom Heskes,et al.  Pruning Using Parameter and Neuronal Metrics , 1999, Neural Computation.

[59]  Shun-ichi Amari,et al.  Natural Gradient Learning for Over- and Under-Complete Bases in ICA , 1999, Neural Computation.

[60]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[61]  Aníbal R. Figueiras-Vidal,et al.  Efficient Block Training of Multilayer Perceptrons , 2000, Neural Computation.

[62]  Terrence J. Sejnowski,et al.  ICA Mixture Models for Unsupervised Classification of Non-Gaussian Classes and Automatic Context Switching in Blind Signal Separation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[63]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[64]  Andrzej Cichocki,et al.  Nonholonomic Orthogonal Learning Algorithms for Blind Source Separation , 2000, Neural Computation.

[65]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[66]  Tom Heskes,et al.  On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[67]  Shun-ichi Amari,et al.  Estimating Functions of Independent Component Analysis for Temporally Correlated Signals , 2000, Neural Computation.

[68]  Sergio Cruces,et al.  An iterative inversion approach to blind source separation , 2000, IEEE Trans. Neural Networks Learn. Syst..

[69]  J. J. Murillo-Fuentes,et al.  Improving stability in blind source separation with stochastic median gradient , 2000 .

[70]  T. Nakada,et al.  Independent component-cross correlation-sequential epoch (ICS) analysis of high field fMRI time series: direct visualization of dual representation of the primary motor cortex in human , 2000, Neurosciences research.

[71]  N. Čencov Statistical Decision Rules and Optimal Inference , 2000 .