Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

We present an algorithm for minimizing a sum of functions that combines the computational efficiency of stochastic gradient descent (SGD) with the second order curvature information leveraged by quasi-Newton methods. We unify these disparate approaches by maintaining an independent Hessian approximation for each contributing function in the sum. We maintain computational tractability and limit memory requirements even for high dimensional optimization problems by storing and manipulating these quadratic approximations in a shared, time evolving, low dimensional subspace. This algorithm contrasts with earlier stochastic second order techniques that treat the Hessian of each contributing function as a noisy approximation to the full Hessian, rather than as a target for direct estimation. Each update step requires only a single contributing function or minibatch evaluation (as in SGD), and each step is scaled using an approximate inverse Hessian and little to no adjustment of hyperparameters is required (as is typical for quasi-Newton methods). We experimentally demonstrate improved convergence on seven diverse optimization problems. The algorithm is released as open source Python and MATLAB packages.

[1]  C. G. Broyden The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations , 1970 .

[2]  R. Fletcher,et al.  A New Approach to Variable Metric Algorithms , 1970, Comput. J..

[3]  C. G. Broyden The Convergence of a Class of Double-rank Minimization Algorithms 2. The New Algorithm , 1970 .

[4]  D. Shanno Conditioning of Quasi-Newton Methods for Function Minimization , 1970 .

[5]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .

[6]  J. J. Moré,et al.  Quasi-Newton Methods, Motivation and Theory , 1974 .

[7]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[8]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[9]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[10]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[11]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[12]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[13]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[14]  Alfred O. Hero,et al.  A Convergent Incremental Gradient Method with a Constant Step Size , 2007, SIAM J. Optim..

[15]  H. Robbins A Stochastic Approximation Method , 1951 .

[16]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[17]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[18]  S. V. N. Vishwanathan,et al.  Variable Metric Stochastic Approximation Theory , 2009, AISTATS.

[19]  Joanna M. Papakonstantinou Historical development of the BFGS secant method and its characterization properties , 2009 .

[20]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[21]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[22]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[23]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[24]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[25]  Pascal Vincent,et al.  Contractive Auto-Encoders: Explicit Invariance During Feature Extraction , 2011, ICML.

[26]  Jascha Sohl-Dickstein,et al.  Minimum Probability Flow Learning , 2009, ICML.

[27]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[28]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[29]  Jascha Sohl-Dickstein,et al.  A new method for parameter estimation in probabilistic models: Minimum probability flow , 2011, Physical review letters.

[30]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[31]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[32]  Jascha Sohl-Dickstein,et al.  The Natural Gradient by Analogy to Signal Whitening, and Recipes and Tricks for its Use , 2012, ArXiv.

[33]  Jascha Sohl-Dickstein,et al.  Efficient and optimal binary Hopfield associative memory storage using minimum probability flow , 2012, 1204.2916.

[34]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[35]  Philipp Hennig,et al.  Fast Probabilistic Optimization from Noisy Gradients , 2013, ICML.

[36]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.

[37]  Julien Mairal,et al.  Optimization with First-Order Surrogate Functions , 2013, ICML.

[38]  Ian J. Goodfellow,et al.  Pylearn2: a machine learning research library , 2013, ArXiv.

[39]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[40]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[41]  Julien Mairal,et al.  Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning , 2014, SIAM J. Optim..

[42]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..