FLAG: Fast Linearly-Coupled Adaptive Gradient Method

The celebrated Nesterov's accelerated gradient method offers great speed-ups compared to the classical gradient descend method as it attains the optimal first-order oracle complexity for smooth convex optimization. On the other hand, the popular AdaGrad algorithm competes with mirror descent under the best regularizer by adaptively scaling the gradient. Recently, it has been shown that the accelerated gradient descent can be viewed as a linear combination of gradient descent and mirror descent steps. Here, we draw upon these ideas and present a fast linearly-coupled adaptive gradient method (FLAG) as an accelerated version of AdaGrad, and show that our algorithm can indeed offer the best of both worlds. Like Nesterov's accelerated algorithm and its proximal variant, FISTA, our method has a convergence rate of $1/T^2$ after $T$ iterations. Like AdaGrad our method adaptively chooses a regularizer, in a way that performs almost as well as the best choice of regularizer in hindsight.

[1]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[2]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[3]  Davide Anguita,et al.  A Public Domain Dataset for Human Activity Recognition using Smartphones , 2013, ESANN.

[4]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[5]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[6]  Michael Elad,et al.  Coordinate and subspace optimization methods for linear least squares with non-quadratic regularization , 2007 .

[7]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[8]  Dinesh Kumar,et al.  Comment Volume Prediction Using Neural Networks and Decision Trees , 2015 .

[9]  Brian Kulis,et al.  Metric Learning: A Survey , 2013, Found. Trends Mach. Learn..

[10]  Uri M. Ascher,et al.  Stochastic Algorithms for Inverse Problems Involving PDEs and many Measurements , 2014, SIAM J. Sci. Comput..

[11]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[12]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[13]  Uri M. Ascher,et al.  A First Course in Numerical Methods , 2011 .

[14]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[15]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  José M. Bioucas-Dias,et al.  A New TwIST: Two-Step Iterative Shrinkage/Thresholding Algorithms for Image Restoration , 2007, IEEE Transactions on Image Processing.

[18]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[19]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[20]  Krisztian Buza,et al.  Feedback Prediction for Blogs , 2012, GfKl.

[21]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[22]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[23]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[24]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[25]  Adil M. Bagirov,et al.  Introduction to Nonsmooth Optimization: Theory, Practice and Software , 2014 .

[26]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[27]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[28]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[29]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[30]  Peter L. Bartlett,et al.  Adaptive Online Gradient Descent , 2007, NIPS.

[31]  Yoshua Bengio,et al.  Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[32]  Uri M. Ascher,et al.  Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables , 2014, SIAM/ASA J. Uncertain. Quantification.

[33]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..