论文信息 - Efficient Implementation of Second-Order Stochastic Approximation Algorithms in High-Dimensional Problems - 字舞流文

Efficient Implementation of Second-Order Stochastic Approximation Algorithms in High-Dimensional Problems

Stochastic approximation (SA) algorithms have been widely applied in minimization problems when the loss functions and/or the gradient information are only accessible through noisy evaluations. Stochastic gradient (SG) descent–a first-order algorithm and a workhorse of much machine learning–is perhaps the most famous form of SA. Among all SA algorithms, the second-order simultaneous perturbation stochastic approximation (2SPSA) and the second-order stochastic gradient (2SG) are particularly efficient in handling high-dimensional problems, covering both gradient-free and gradient-based scenarios. However, due to the necessary matrix operations, the per-iteration floating-point-operations (FLOPs) cost of the standard 2SPSA/2SG is <inline-formula> <tex-math notation="LaTeX">$O(p^{3})$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$p$ </tex-math></inline-formula> is the dimension of the underlying parameter. Note that the <inline-formula> <tex-math notation="LaTeX">$O(p^{3})$ </tex-math></inline-formula> FLOPs cost is distinct from the classical SPSA-based per-iteration <inline-formula> <tex-math notation="LaTeX">$O(1)$ </tex-math></inline-formula> cost in terms of the number of noisy function evaluations. In this work, we propose a technique to efficiently implement the 2SPSA/2SG algorithms via the symmetric indefinite matrix factorization and show that the FLOPs cost is reduced from <inline-formula> <tex-math notation="LaTeX">$O(p^{3})$ </tex-math></inline-formula> to <inline-formula> <tex-math notation="LaTeX">$O(p^{2})$ </tex-math></inline-formula>. The formal almost sure convergence and rate of convergence for the newly proposed approach are directly inherited from the standard 2SPSA/2SG. The improvement in efficiency and numerical stability is demonstrated in two numerical studies.

Jingyi Zhu | James C. Spall | Long Wang | J. Spall | Jingyi Zhu | Long Wang

[1] Surya Ganguli,et al. Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods , 2013, ICML.

[2] D. Sorensen. Updating the Symmetric Indefinite Factorization with Applications in a Modified Newton's Method , 1977 .

[3] Charles R. Johnson,et al. Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[4] James C. Spall,et al. Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[5] Tony R. Martinez,et al. The general inefficiency of batch training for gradient descent learning , 2003, Neural Networks.

[6] Jorge Nocedal,et al. A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[7] Tim Hesterberg,et al. Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[8] J. Bunch,et al. Direct Methods for Solving Symmetric Indefinite Systems of Linear Equations , 1971 .

[9] Dong Shen,et al. Multidimensional Gains for Stochastic Approximation , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[10] N. Higham. Computing real square roots of a real matrix , 1987 .

[11] James C. Spall,et al. Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm , 2007, IEEE Transactions on Automatic Control.

[12] Gu Dun-he,et al. A NOTE ON A LOWER BOUND FOR THE SMALLEST SINGULAR VALUE , 1997 .

[13] D. Ruppert. A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure , 1985 .

[14] Simon Günter,et al. A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[15] J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[16] Discrete simultaneous perturbation stochastic approximation on loss function with noisy measurements , 2011, Proceedings of the 2011 American Control Conference.

[17] James Schalkwyk. Aerodynamic Design Using Neural Networks , 2015 .

[18] F. Downton. Stochastic Approximation , 1969, Nature.

[19] J. Spall,et al. A modified second‐order SPSA optimization algorithm for finite samples , 2002 .

[20] L. A. Prashanth,et al. Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods , 2012 .

[21] James C. Spall. Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm , 2009, IEEE Trans. Autom. Control..

[22] Long Wang,et al. Mixed Simultaneous Perturbation Stochastic Approximation for Gradient-Free Optimization with Noisy Measurements , 2018, 2018 Annual American Control Conference (ACC).

[23] A. George,et al. Parallel Cholesky factorization on a shared-memory multiprocessor. Final report, 1 October 1986-30 September 1987 , 1986 .

[24] Stephen J. Wright,et al. Numerical Optimization , 2018, Fundamental Statistical Inference.

[25] Leonardo Antonio Errasquin. Airfoil Self-Noise Prediction Using Neural Networks for Wind Turbines , 2009 .

[26] Thomas F. Brooks,et al. Airfoil self-noise and prediction , 1989 .

[27] J. Spall. Implementation of the simultaneous perturbation algorithm for stochastic optimization , 1998 .

[28] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[29] Kenji Fukumizu,et al. Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[30] J. Sylvester. XIX. A demonstration of the theorem that every homogeneous quadratic polynomial is reducible by real orthogonal substitutions to the form of a sum of positive and negative squares , 1852 .

[31] Jingyi Zhu,et al. Efficient implementation of enhanced adaptive simultaneous perturbation algorithms , 2016, 2016 Annual Conference on Information Science and Systems (CISS).

[32] T. Brooks,et al. Trailing edge noise prediction from measured surface pressures , 1981 .

[33] M. T. Wasan. Stochastic Approximation , 1969 .

[34] Jack J. Dongarra,et al. A fully parallel algorithm for the symmetric eigenvalue problem , 1985, PPSC.

[35] James C. Spall,et al. Adaptive stochastic approximation by the simultaneous perturbation method , 2000, IEEE Trans. Autom. Control..

[36] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[37] Shalabh Bhatnagar,et al. Adaptive System Optimization Using Random Directions Stochastic Approximation , 2015, IEEE Transactions on Automatic Control.

[38] Phillipp Meister,et al. Stochastic Recursive Algorithms For Optimization Simultaneous Perturbation Methods , 2016 .

[39] J. Bunch,et al. Some stable methods for calculating inertia and solving symmetric linear systems , 1977 .

[40] Nicholas J. Higham,et al. Blocked Schur Algorithms for Computing the Matrix Square Root , 2012, PARA.

[41] Jorge Reyes,et al. Prediction of PM2.5 concentrations several hours in advance using neural networks in Santiago, Chile , 2000 .

[42] Xi-Lin Li,et al. Preconditioned Stochastic Gradient Descent , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[43] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.