论文信息 - Revisiting Natural Gradient for Deep Networks - 字舞流文

Revisiting Natural Gradient for Deep Networks

We evaluate natural gradient, an algorithm originally proposed in Amari (1997), for learning deep models. The contributions of this paper are as follows. We show the connection between natural gradient and three other recently proposed methods for training deep models: Hessian-Free (Martens, 2010), Krylov Subspace Descent (Vinyals and Povey, 2012) and TONGA (Le Roux et al., 2008). We describe how one can use unlabeled data to improve the generalization error obtained by natural gradient and empirically evaluate the robustness of the algorithm to the ordering of the training set compared to stochastic gradient descent. Finally we extend natural gradient to incorporate second order information alongside the manifold information and provide a benchmark of the new algorithm using a truncated Newton approach for inverting the metric matrix instead of using a diagonal approximation of it.

Razvan Pascanu | Yoshua Bengio | Yoshua Bengio | Razvan Pascanu

[1] Shun-ichi Amari,et al. Differential-geometrical methods in statistics , 1985 .

[2] F. Götze. Differential-geometrical methods in statistics. Lecture notes in statistics - A. Shun-ichi. , 1987 .

[3] Shun-ichi Amari,et al. Information geometry of Boltzmann machines , 1992, IEEE Trans. Neural Networks.

[4] J. Shewchuk. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[5] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[6] Shun-ichi Amari,et al. Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[7] Magnus Rattray,et al. Natural gradient descent for on-line learning , 1998 .

[8] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[9] Kenji Fukumizu,et al. Local minima and plateaus in hierarchical structures of multilayer perceptrons , 2000, Neural Networks.

[10] Kenji Fukumizu,et al. Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[11] Tom Heskes,et al. On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[12] Shun-ichi Amari,et al. Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons , 2001, NIPS.

[13] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[14] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[15] Nicol N. Schraudolph. Fast Curvature Matrix-Vector Products , 2001, ICANN.

[16] Nicol N. Schraudolph,et al. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[17] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[18] Christopher M. Bishop,et al. Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[19] Nicolas Le Roux,et al. Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[20] Juha Karhunen,et al. Natural Conjugate Gradient in Variational Inference , 2007, ICONIP.

[21] José R. Dorronsoro,et al. Natural conjugate gradient training of multilayer perceptrons , 2006, Neurocomputing.

[22] Tom Schaul,et al. Stochastic search using the natural gradient , 2009, ICML '09.

[23] Levent Tunçel,et al. Optimization algorithms on matrix manifolds , 2009, Math. Comput..

[24] Yoshua Bengio,et al. Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[25] Nicolas Le Roux,et al. Improving First and Second-Order Methods by Modeling Uncertainty , 2010 .

[26] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[27] Juha Karhunen,et al. Approximate Riemannian Conjugate Gradient Learning for Fixed-Form Variational Bayes , 2010, J. Mach. Learn. Res..

[28] Andrew W. Fitzgibbon,et al. A fast natural Newton method , 2010, ICML.

[29] Ilya Sutskever,et al. Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[30] Razvan Pascanu,et al. Deep Learners Benefit More from Out-of-Distribution Examples , 2011, AISTATS.

[31] O. Chapelle. Improved Preconditioner for Hessian Free Optimization , 2011 .

[32] Geoffrey E. Hinton,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.

[33] Michael A. Saunders,et al. MINRES-QLP: A Krylov Subspace Method for Indefinite or Singular Symmetric Systems , 2010, SIAM J. Sci. Comput..

[34] Tom Schaul,et al. Natural evolution strategies converge on sphere functions , 2012, GECCO '12.

[35] Jascha Sohl-Dickstein,et al. The Natural Gradient by Analogy to Signal Whitening, and Recipes and Tricks for its Use , 2012, ArXiv.

[36] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[37] Daniel Povey,et al. Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[38] Pascal Vincent,et al. Disentangling Factors of Variation for Facial Expression Recognition , 2012, ECCV.

[39] Razvan Pascanu,et al. Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines , 2013, ICLR.

[40] Ryan Kiros,et al. Training Neural Networks with Stochastic Hessian-Free Optimization , 2013, ICLR.