Revisiting Natural Gradient for Deep Networks

We evaluate natural gradient, an algorithm originally proposed in Amari (1997), for learning deep models. The contributions of this paper are as follows. We show the connection between natural gradient and three other recently proposed methods for training deep models: Hessian-Free (Martens, 2010), Krylov Subspace Descent (Vinyals and Povey, 2012) and TONGA (Le Roux et al., 2008). We describe how one can use unlabeled data to improve the generalization error obtained by natural gradient and empirically evaluate the robustness of the algorithm to the ordering of the training set compared to stochastic gradient descent. Finally we extend natural gradient to incorporate second order information alongside the manifold information and provide a benchmark of the new algorithm using a truncated Newton approach for inverting the metric matrix instead of using a diagonal approximation of it.

[1]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[2]  F. Götze Differential-geometrical methods in statistics. Lecture notes in statistics - A. Shun-ichi. , 1987 .

[3]  Shun-ichi Amari,et al.  Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[4]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[5]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[6]  Shun-ichi Amari,et al.  Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[7]  Magnus Rattray,et al.  Natural gradient descent for on-line learning , 1998 .

[8]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[9]  Kenji Fukumizu,et al.  Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[10]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[11]  Tom Heskes,et al.  On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[12]  Shun-ichi Amari,et al.  Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons , 2001, NIPS.

[13]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[14]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[15]  Nicol N. Schraudolph Fast Curvature Matrix-Vector Products , 2001, ICANN.

[16]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[17]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[18]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[19]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[20]  Juha Karhunen,et al.  Natural Conjugate Gradient in Variational Inference , 2007, ICONIP.

[21]  José R. Dorronsoro,et al.  Natural conjugate gradient training of multilayer perceptrons , 2006, Neurocomputing.

[22]  Tom Schaul,et al.  Stochastic search using the natural gradient , 2009, ICML '09.

[23]  Levent Tunçel,et al.  Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[24]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[25]  Nicolas Le Roux,et al.  Improving First and Second-Order Methods by Modeling Uncertainty , 2010 .

[26]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[27]  Juha Karhunen,et al.  Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes , 2010, J. Mach. Learn. Res..

[28]  Andrew W. Fitzgibbon,et al.  A fast natural Newton method , 2010, ICML.

[29]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[30]  Razvan Pascanu,et al.  Deep Learners Benefit More from Out-of-Distribution Examples , 2011, AISTATS.

[31]  O. Chapelle Improved Preconditioner for Hessian Free Optimization , 2011 .

[32]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[33]  Michael A. Saunders,et al.  MINRES-QLP: A Krylov Subspace Method for Indefinite or Singular Symmetric Systems , 2010, SIAM J. Sci. Comput..

[34]  Tom Schaul,et al.  Natural evolution strategies converge on sphere functions , 2012, GECCO '12.

[35]  Jascha Sohl-Dickstein,et al.  The Natural Gradient by Analogy to Signal Whitening, and Recipes and Tricks for its Use , 2012, ArXiv.

[36]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[37]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[38]  Pascal Vincent,et al.  Disentangling Factors of Variation for Facial Expression Recognition , 2012, ECCV.

[39]  Razvan Pascanu,et al.  Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines , 2013, ICLR.

[40]  Ryan Kiros,et al.  Training Neural Networks with Stochastic Hessian-Free Optimization , 2013, ICLR.