Connection of Diagonal Hessian Estimates to Natural Gradients in Stochastic Optimization

With massive resurgence of artificial intelligence, statistical learning theory and information science, the core technology of AI, are getting growing attention. To deal with massive data, efficient learning algorithms are required in statistical learning. In deep learning, natural gradient algorithms, such as AdaGrad and Adam, are widely used, motivated by the idea of Newton's approach that applies second-order derivatives to rescale gradients. By approximating the second-order geometry of the empirical loss with the empirical Fisher information matrix (FIM), natural gradient methods are expected to obtain extra efficiency of learning. However, the exact curvature of the empirical loss is described by the Hessian matrix, not the FIM, and biases between the empirical FIM and the Hessian always exist before convergence, which will affect the expected efficiency. In this paper, we present a new stochastic optimization algorithm, diagSG (diagonal Hessian stochastic gradient), in the setting of deep learning. As a second-order algorithm, diagSG estimates the diagonal entries of the Hessian matrix at each iteration through simultaneous perturbation stochastic approximation (SPSA) and applies the diagonal entries for the adaptive learning rate in optimization. By comparing the rescaling matrices in diagSG and in natural gradient methods, we argue that diagSG possess advantages in characterizing loss curvature with better approximation of Hessian diagonals. In practical part, we provide a experiment to endorse our argument.

[1]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[2]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[3]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[4]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[5]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[6]  James C. Spall,et al.  Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm , 2007, IEEE Transactions on Automatic Control.

[7]  Ibrahim M. Alabdulmohsin Information Theoretic Guarantees for Empirical Risk Minimization with Applications to Model Selection and Large-Scale Optimization , 2018, ICML.

[8]  James C. Spall,et al.  Adaptive stochastic approximation by the simultaneous perturbation method , 2000, IEEE Trans. Autom. Control..

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[11]  James C. Spall,et al.  SPSA Method Using Diagonalized Hessian Estimate , 2019, 2019 IEEE 58th Conference on Decision and Control (CDC).

[12]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[13]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.