The Tradeoffs of Large Scale Learning

This contribution develops a theoretical framework that takes into account the effect of approximate optimization on learning algorithms. The analysis shows distinct tradeoffs for the case of small-scale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation-estimation tradeoff. Large-scale learning problems are subject to a qualitatively different tradeoff involving the computational complexity of the underlying optimization algorithms in non-trivial ways.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Vladimir Vapnik,et al.  Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics) , 1982 .

[3]  John E. Dennis,et al.  Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[4]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[5]  J. Stephen Judd,et al.  On the complexity of loading shallow neural networks , 1988, J. Complex..

[6]  Yann LeCun,et al.  Measuring the VC-Dimension of a Learning Machine , 1994, Neural Computation.

[7]  Peter L. Bartlett,et al.  The importance of convexity in learning with squared loss , 1998, COLT '96.

[8]  Peter L. Bartlett,et al.  The Importance of Convexity in Learning with Squared Loss , 1998, IEEE Trans. Inf. Theory.

[9]  Noboru Murata,et al.  A Statistical Study on On-line Learning , 1999 .

[10]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[11]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[14]  O. Bousquet Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms , 2002 .

[15]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[16]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[17]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[18]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[19]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[20]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[21]  Ingo Steinwart,et al.  Fast Rates for Support Vector Machines , 2005, COLT.

[22]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[23]  P. Bartlett,et al.  Empirical minimization , 2006 .

[24]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[25]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[26]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[27]  Don R. Hush,et al.  QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines , 2006, J. Mach. Learn. Res..

[28]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[29]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[30]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[31]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.