On the generalization ability of on-line learning algorithms

In this paper, it is shown how to extract a hypothesis with small risk from the ensemble of hypotheses generated by an arbitrary on-line learning algorithm run on an independent and identically distributed (i.i.d.) sample of data. Using a simple large deviation argument, we prove tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic M/sub n/ associated with the on-line performance of the ensemble. Via sharp pointwise bounds on M/sub n/, we then obtain risk tail bounds for kernel perceptron algorithms in terms of the spectrum of the empirical kernel matrix. These bounds reveal that the linear hypotheses found via our approach achieve optimal tradeoffs between hinge loss and margin size over the class of all linear functions, an issue that was left open by previous results. A distinctive feature of our approach is that the key tools for our analysis come from the model of prediction of individual sequences; i.e., a model making no probabilistic assumptions on the source generating the data. In fact, these tools turn out to be so powerful that we only need very elementary statistical facts to obtain our final risk bounds.

[1]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[2]  H. D. Block The perceptron: a model for brain functioning. I , 1962 .

[3]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[4]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[5]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[6]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[7]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[8]  C. V. D. Malsburg,et al.  Frank Rosenblatt: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms , 1986 .

[9]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[10]  D. Angluin Queries and Concept Learning , 1988 .

[11]  Vladimir Vapnik,et al.  Inductive principles of the search for empirical dependences (methods based on weak convergence of probability measures) , 1989, COLT '89.

[12]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[13]  Nick Littlestone,et al.  Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow , 1991, COLT '91.

[14]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[15]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[16]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[17]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[18]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[19]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[20]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[21]  Vladimir Vovk,et al.  Competitive On-line Linear Regression , 1997, NIPS.

[22]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[23]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[24]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[25]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[26]  Axthonv G. Oettinger,et al.  IEEE Transactions on Information Theory , 1998 .

[27]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[28]  Claudio Gentile,et al.  Linear Hinge Loss and Average Margin , 1998, NIPS.

[29]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[30]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[33]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[34]  Yoav Freund,et al.  Self bounding learning algorithms , 1998, COLT' 98.

[35]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[36]  John Langford,et al.  Beating the hold-out: bounds for K-fold and progressive cross-validation , 1999, COLT '99.

[37]  S. Boucheron,et al.  A sharp concentration inequality with applications , 1999, Random Struct. Algorithms.

[38]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[39]  Manfred K. Warmuth,et al.  Relative Expected Instantaneous Loss Bounds , 2000, J. Comput. Syst. Sci..

[40]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[41]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[42]  S. Boucheron,et al.  A sharp concentration inequality with applications , 1999, Random Struct. Algorithms.

[43]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[44]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[45]  Tamás Linder,et al.  Data-dependent margin-based generalization bounds for classification , 2001, J. Mach. Learn. Res..

[46]  John Langford,et al.  An Improved Predictive Accuracy Bound for Averaging Classifiers , 2001, ICML.

[47]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[48]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[49]  Nello Cristianini,et al.  On the generalization of soft margin algorithms , 2002, IEEE Trans. Inf. Theory.

[50]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[51]  Ron Meir,et al.  Generalization Error Bounds for Bayesian Mixture Algorithms , 2003, J. Mach. Learn. Res..

[52]  G. Lugosi,et al.  Data-dependent margin-based generalization bounds for classification , 2003 .

[53]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[54]  Peter Auer,et al.  Tracking the Best Disjunction , 1998, Machine Learning.

[55]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[56]  Philip M. Long The Complexity of Learning According to Two Models of a Drifting Environment , 1998, COLT' 98.

[57]  John Langford,et al.  Microchoice Bounds and Self Bounding Learning Algorithms , 2003, Machine Learning.

[58]  Claudio Gentile,et al.  A Second-Order Perceptron Algorithm , 2002, SIAM J. Comput..