Stochastic version of second-order (Newton-Raphson) optimization using only function measurements

Consider the problem of loss function minimization when only (possibly noisy) measurements of the loss function are available. In particular, no measurements of the gradient of the loss function are assumed available (as required in the steepest descent or Newton-Raphson algorithms). Stochastic approximation (SA) algorithms of the multivariate Kiefer-Wolfowitz (finite-difference) form have long been considered for such problems, but with only limited success. The simultaneous perturbation SA (SPSA) algorithm has successfully addressed one of the major shortcomings of those finite-difference SA algorithms by significantly reducing the number of measurements required in many multivariate problems of practical interest. This SPSA algorithm displays the classic behaviour of 1st-order search algorithms by typically exhibiting a steep initial decline in the loss function followed by a slow decline to the optimum. This paper presents a 2nd-order SPSA algorithm that is based on estimating both the loss function gradient and inverse Hessian matrix at each iteration. The aim of this approach is to emulate the acceleration properties associated with deterministic algorithms of Newton-Raphson form, particularly in the terminal phase where the 1st-order SPSA algorithm slows down in its convergence. This 2nd-order SPSA algorithm requires only three loss function measurements at each iteration, independent of the problem dimension. This paper includes a formal convergence result for this 2nd-order approach.

[1]  M. T. Wasan Stochastic Approximation , 1969 .

[2]  J. Spall,et al.  Direct adaptive control of nonlinear systems using neural networks and stochastic approximation , 1992, [1992] Proceedings of the 31st IEEE Conference on Decision and Control.

[3]  Comparative study of stochastic gradient-free algorithms for system optimization , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[4]  F. Downton Stochastic Approximation , 1969, Nature.

[5]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[6]  M. Metivier,et al.  Applications of a Kushner and Clark lemma to general classes of stochastic algorithms , 1984, IEEE Trans. Inf. Theory.

[7]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[8]  J. Spall A second order stochastic approximation algorithm using only function measurements , 1994, Proceedings of 1994 33rd IEEE Conference on Decision and Control.

[9]  D. Ruppert A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure , 1985 .

[10]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[11]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[12]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[13]  Alexander Graham,et al.  Kronecker Products and Matrix Calculus: With Applications , 1981 .

[14]  조용현,et al.  Stochastic Approximation Algorithm을 이용한 최적화용 신경회로망의 성능 개선 , 1992 .

[15]  J. Sacks Asymptotic Distribution of Stochastic Approximation Procedures , 1958 .

[16]  J. Spall,et al.  Nonlinear adaptive control using neural networks: estimation with a smoothed form of simultaneous perturbation gradient approximation , 1994, Proceedings of 1994 American Control Conference - ACC '94.

[17]  Alston S. Householder,et al.  The Theory of Matrices in Numerical Analysis , 1964 .

[18]  J. Doob Stochastic processes , 1953 .

[19]  J. Spall A stochastic approximation algorithm for large-dimensional systems in the Kiefer-Wolfowitz setting , 1988, Proceedings of the 27th IEEE Conference on Decision and Control.

[20]  G. Pflug Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, HJ.; Clark, D.S. , 1980 .