Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.

[1]  C. Davis Notions generalizing convexity for functions defined on spaces of matrices , 1963 .

[2]  R. Fletcher,et al.  A New Approach to Variable Metric Algorithms , 1970, Comput. J..

[3]  N. Z. Shor Utilization of the operation of space dilatation in the minimization of convex functions , 1972 .

[4]  T. Andô Concavity of certain maps on positive definite matrices and applications to Hadamard products , 1979 .

[5]  P. Brucker Review of recent development: An O( n) algorithm for quadratic knapsack problems , 1984 .

[6]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  Panos M. Pardalos,et al.  An algorithm for a singly constrained class of quadratic programs subject to upper and lower bounds , 1990, Math. Program..

[9]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[10]  James V. Bondar Inequalities: Theory of majorization and its applications: by Albert W. Marshall and Ingram Olkin , 1994 .

[11]  Manfred K. Warmuth Proceedings of the seventh annual conference on Computational learning theory , 1994, COLT 1994.

[12]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[13]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[14]  Angelia Nedic,et al.  Subgradient methods for convex minimization , 2002 .

[15]  Angelia NediÄ,et al.  Subgradient methods for convex minimization , 2002 .

[16]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[17]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[18]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[19]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[20]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[21]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[22]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[23]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[24]  Claudio Gentile,et al.  A Second-Order Perceptron Algorithm , 2002, SIAM J. Comput..

[25]  Adam Tauman Kalai,et al.  Logarithmic Regret Algorithms for Online Convex Optimization , 2006, COLT.

[26]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[27]  Yishay Mansour,et al.  Improved second-order bounds for prediction with expert advice , 2006, Machine Learning.

[28]  Y. Singer,et al.  Logarithmic Regret Algorithms for Strongly Convex Repeated Games , 2007 .

[29]  David Grangier,et al.  A Discriminative Kernel-based Model to Rank Images from Text Queries , 2007 .

[30]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[31]  Peter L. Bartlett,et al.  Adaptive Online Gradient Descent , 2007, NIPS.

[32]  G. Obozinski Joint covariate selection for grouped classification , 2007 .

[33]  Ambuj Tewari,et al.  Optimal Stragies and Minimax Lower Bounds for Online Convex Games , 2008, COLT.

[34]  Koby Crammer,et al.  Exact Convex Confidence-Weighted Learning , 2008, NIPS.

[35]  Samy Bengio,et al.  A Discriminative Kernel-Based Approach to Rank Images from Text Queries , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[37]  I. Daubechies,et al.  Accelerated Projected Gradient Method for Linear Inverse Problems with Sparsity Constraints , 2007, 0706.4297.

[38]  A. Juditsky,et al.  Solving variational inequalities with Stochastic Mirror-Prox algorithm , 2008, 0809.0815.

[39]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[40]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[41]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[42]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[43]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[44]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[45]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[46]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[47]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[48]  Elad Hazan,et al.  Extracting certainty from uncertainty: regret bounded by variation in costs , 2008, Machine Learning.

[49]  Gerhard J. Woeginger,et al.  Operations Research Letters , 2011 .

[50]  DuchiJohn,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011 .

[51]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.