A Stochastic Quasi-Newton Method for Large-Scale Optimization

The question of how to incorporate curvature information in stochastic approximation methods is challenging. The direct application of classical quasi- Newton updating techniques for deterministic optimization leads to noisy curvature estimates that have harmful effects on the robustness of the iteration. In this paper, we propose a stochastic quasi-Newton method that is efficient, robust and scalable. It employs the classical BFGS update formula in its limited memory form, and is based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products. This technique differs from the classical approach that would compute differences of gradients, and where controlling the quality of the curvature estimates can be difficult. We present numerical results on problems arising in machine learning that suggest that the proposed method shows much promise.

[1]  R. Fletcher Practical Methods of Optimization , 1988 .

[2]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[3]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[4]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[5]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[6]  Kenji Fukumizu,et al.  Adaptive natural gradient learning algorithms for various stochastic models , 2000, Neural Networks.

[7]  Stephen J. Wright,et al.  Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[8]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[9]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[10]  A. Plakhov,et al.  A Stochastic Approximation Algorithm with Step-Size Adaptation , 2004 .

[11]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[12]  H. Robbins A Stochastic Approximation Method , 1951 .

[13]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[14]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[15]  Peter W. Glynn,et al.  Stochastic Simulation: Algorithms and Analysis , 2007 .

[16]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[17]  S. V. N. Vishwanathan,et al.  Variable Metric Stochastic Approximation Theory , 2009, AISTATS.

[18]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[19]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[20]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[21]  Gillian M. Chin,et al.  On the Use of Stochastic Hessian Information in Unconstrained Optimization , 2010 .

[22]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[23]  Andrew W. Fitzgibbon,et al.  A fast natural Newton method , 2010, ICML.

[24]  Angelia Nedic,et al.  On stochastic gradient and subgradient methods with adaptive steplength sequences , 2011, Autom..

[25]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[26]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[27]  Aryan Mokhtari,et al.  Regularized stochastic BFGS algorithm , 2013, 2013 IEEE Global Conference on Signal and Information Processing.

[28]  Yoram Singer,et al.  Parallel Boosting with Momentum , 2013, ECML/PKDD.