Efficient Implementation of Second-Order Stochastic Approximation Algorithms in High-Dimensional Problems

Stochastic approximation (SA) algorithms have been widely applied in minimization problems when the loss functions and/or the gradient information are only accessible through noisy evaluations. Stochastic gradient (SG) descent–a first-order algorithm and a workhorse of much machine learning–is perhaps the most famous form of SA. Among all SA algorithms, the second-order simultaneous perturbation stochastic approximation (2SPSA) and the second-order stochastic gradient (2SG) are particularly efficient in handling high-dimensional problems, covering both gradient-free and gradient-based scenarios. However, due to the necessary matrix operations, the per-iteration floating-point-operations (FLOPs) cost of the standard 2SPSA/2SG is <inline-formula> <tex-math notation="LaTeX">$O(p^{3})$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$p$ </tex-math></inline-formula> is the dimension of the underlying parameter. Note that the <inline-formula> <tex-math notation="LaTeX">$O(p^{3})$ </tex-math></inline-formula> FLOPs cost is distinct from the classical SPSA-based per-iteration <inline-formula> <tex-math notation="LaTeX">$O(1)$ </tex-math></inline-formula> cost in terms of the number of noisy function evaluations. In this work, we propose a technique to efficiently implement the 2SPSA/2SG algorithms via the symmetric indefinite matrix factorization and show that the FLOPs cost is reduced from <inline-formula> <tex-math notation="LaTeX">$O(p^{3})$ </tex-math></inline-formula> to <inline-formula> <tex-math notation="LaTeX">$O(p^{2})$ </tex-math></inline-formula>. The formal almost sure convergence and rate of convergence for the newly proposed approach are directly inherited from the standard 2SPSA/2SG. The improvement in efficiency and numerical stability is demonstrated in two numerical studies.

[1]  Surya Ganguli,et al.  Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods , 2013, ICML.

[2]  D. Sorensen Updating the Symmetric Indefinite Factorization with Applications in a Modified Newton's Method , 1977 .

[3]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[4]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[5]  Tony R. Martinez,et al.  The general inefficiency of batch training for gradient descent learning , 2003, Neural Networks.

[6]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[7]  Tim Hesterberg,et al.  Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control , 2004, Technometrics.

[8]  J. Bunch,et al.  Direct Methods for Solving Symmetric Indefinite Systems of Linear Equations , 1971 .

[9]  Dong Shen,et al.  Multidimensional Gains for Stochastic Approximation , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[10]  N. Higham Computing real square roots of a real matrix , 1987 .

[11]  James C. Spall,et al.  Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm , 2007, IEEE Transactions on Automatic Control.

[12]  Gu Dun-he,et al.  A NOTE ON A LOWER BOUND FOR THE SMALLEST SINGULAR VALUE , 1997 .

[13]  D. Ruppert A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure , 1985 .

[14]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[15]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[16]  Discrete simultaneous perturbation stochastic approximation on loss function with noisy measurements , 2011, Proceedings of the 2011 American Control Conference.

[17]  James Schalkwyk Aerodynamic Design Using Neural Networks , 2015 .

[18]  F. Downton Stochastic Approximation , 1969, Nature.

[19]  J. Spall,et al.  A modified second‐order SPSA optimization algorithm for finite samples , 2002 .

[20]  L. A. Prashanth,et al.  Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods , 2012 .

[21]  James C. Spall Feedback and Weighting Mechanisms for Improving Jacobian Estimates in the Adaptive Simultaneous Perturbation Algorithm , 2009, IEEE Trans. Autom. Control..

[22]  Long Wang,et al.  Mixed Simultaneous Perturbation Stochastic Approximation for Gradient-Free Optimization with Noisy Measurements , 2018, 2018 Annual American Control Conference (ACC).

[23]  A. George,et al.  Parallel Cholesky factorization on a shared-memory multiprocessor. Final report, 1 October 1986-30 September 1987 , 1986 .

[24]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[25]  Leonardo Antonio Errasquin Airfoil Self-Noise Prediction Using Neural Networks for Wind Turbines , 2009 .

[26]  Thomas F. Brooks,et al.  Airfoil self-noise and prediction , 1989 .

[27]  J. Spall Implementation of the simultaneous perturbation algorithm for stochastic optimization , 1998 .

[28]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[29]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[30]  J. Sylvester XIX. A demonstration of the theorem that every homogeneous quadratic polynomial is reducible by real orthogonal substitutions to the form of a sum of positive and negative squares , 1852 .

[31]  Jingyi Zhu,et al.  Efficient implementation of enhanced adaptive simultaneous perturbation algorithms , 2016, 2016 Annual Conference on Information Science and Systems (CISS).

[32]  T. Brooks,et al.  Trailing edge noise prediction from measured surface pressures , 1981 .

[33]  M. T. Wasan Stochastic Approximation , 1969 .

[34]  Jack J. Dongarra,et al.  A fully parallel algorithm for the symmetric eigenvalue problem , 1985, PPSC.

[35]  James C. Spall,et al.  Adaptive stochastic approximation by the simultaneous perturbation method , 2000, IEEE Trans. Autom. Control..

[36]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[37]  Shalabh Bhatnagar,et al.  Adaptive System Optimization Using Random Directions Stochastic Approximation , 2015, IEEE Transactions on Automatic Control.

[38]  Phillipp Meister,et al.  Stochastic Recursive Algorithms For Optimization Simultaneous Perturbation Methods , 2016 .

[39]  J. Bunch,et al.  Some stable methods for calculating inertia and solving symmetric linear systems , 1977 .

[40]  Nicholas J. Higham,et al.  Blocked Schur Algorithms for Computing the Matrix Square Root , 2012, PARA.

[41]  Jorge Reyes,et al.  Prediction of PM2.5 concentrations several hours in advance using neural networks in Santiago, Chile , 2000 .

[42]  Xi-Lin Li,et al.  Preconditioned Stochastic Gradient Descent , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.