Pegasos: primal estimated sub-gradient solver for SVM

We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously devised SVM solvers, the number of iterations also scales linearly with 1/λ, where λ is the regularization parameter of SVM. For a linear kernel, the total run-time of our method is $${\tilde{O}(d/(\lambda \epsilon))}$$, where d is a bound on the number of non-zero features in each example. Since the run-time does not depend directly on the size of the training set, the resulting algorithm is especially suited for learning from large datasets. Our approach also extends to non-linear kernels while working solely on the primal objective function, though in this case the runtime does depend linearly on the training set size. Our algorithm is particularly well suited for large text classification problems, where we demonstrate an order-of-magnitude speedup over previous SVM learning methods.

[1]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[4]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[5]  Y. Censor,et al.  Parallel Optimization: Theory, Algorithms, and Applications , 1997 .

[6]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[7]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[8]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT.

[9]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[10]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[11]  Shun-ichi Amari,et al.  Statistical analysis of learning dynamics , 1999, Signal Process..

[12]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[13]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[14]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[15]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[16]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[17]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[18]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[19]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[20]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[21]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[22]  Jason Weston,et al.  Fast Kernel Classifiers with Online and Active Learning , 2005, J. Mach. Learn. Res..

[23]  Adam Tauman Kalai,et al.  Logarithmic Regret Algorithms for Online Convex Optimization , 2006, COLT.

[24]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[25]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[26]  Don R. Hush,et al.  QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines , 2006, J. Mach. Learn. Res..

[27]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[28]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[29]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[30]  Alexander J. Smola,et al.  Bundle Methods for Machine Learning , 2007, NIPS.

[31]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[32]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[33]  Ambuj Tewari,et al.  On the Generalization Ability of Online Strongly Convex Programming Algorithms , 2008, NIPS.

[34]  Nathan Srebro,et al.  Fast Rates for Regularized Objectives , 2008, NIPS.

[35]  Quoc V. Le,et al.  Proximal regularization for online and batch learning , 2009, ICML '09.

[36]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..