A Linearly-Convergent Stochastic L-BFGS Algorithm

We propose a new stochastic L-BFGS algorithm and prove a linear convergence rate for strongly convex and smooth functions. Our algorithm draws heavily from a recent stochastic variant of L-BFGS proposed in Byrd et al. (2014) as well as a recent approach to variance reduction for stochastic gradient descent from Johnson and Zhang (2013). We demonstrate experimentally that our algorithm performs well on large-scale convex and non-convex optimization problems, exhibiting linear convergence and rapidly solving the optimization problems to high levels of precision. Furthermore, we show that our algorithm performs well for a wide-range of step sizes, often differing by several orders of magnitude.

[1]  J. J. Moré,et al.  Quasi-Newton Methods, Motivation and Theory , 1974 .

[2]  R. Dembo,et al.  INEXACT NEWTON METHODS , 1982 .

[3]  Trond Steihaug,et al.  Truncated-newtono algorithms for large-scale unconstrained optimization , 1983, Math. Program..

[4]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[5]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[6]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[7]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[8]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[9]  H. Robbins A Stochastic Approximation Method , 1951 .

[10]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[11]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[12]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[13]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[14]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[15]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[16]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[17]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[18]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[19]  Xi Chen,et al.  Variance Reduction for Stochastic Gradient Optimization , 2013, NIPS.

[20]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[21]  Christopher Ré,et al.  Parallel stochastic gradient algorithms for large-scale matrix completion , 2013, Math. Program. Comput..

[22]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[23]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[24]  Surya Ganguli,et al.  Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods , 2013, ICML.

[25]  Aryan Mokhtari,et al.  RES: Regularized Stochastic BFGS Algorithm , 2014, IEEE Transactions on Signal Processing.

[26]  Aryan Mokhtari,et al.  Global convergence of online limited memory BFGS , 2014, J. Mach. Learn. Res..

[27]  Thomas Hofmann,et al.  A Variance Reduced Stochastic Newton Method , 2015, ArXiv.

[28]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[31]  Shiqian Ma,et al.  Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization , 2014, SIAM J. Optim..