Adaptive and Self-Confident On-Line Learning Algorithms

Most of the performance bounds for on-line learning algorithms are proven assuming a constant learning rate. To optimize these bounds, the learning rate must be tuned based on quantities that are generally unknown, as they depend on the whole sequence of examples. In this paper we show that essentially the same optimized bounds can be obtained when the algorithms adaptively tune their learning rates as the examples in the sequence are progressively revealed. Our adaptive learning rates apply to a wide class of on-line algorithms, including p-norm algorithms for generalized linear regression and Weighted Majority for linear regression with absolute loss. We emphasize that our adaptive tunings are radically different from previous techniques, such as the so-called doubling trick. Whereas the doubling trick restarts the on-line algorithm several times using a constant learning rate for each run, our methods save information by changing the value of the learning rate very smoothly. In fact, for Weighted Majority over a finite set of experts our analysis provides a better leading constant than the doubling trick.

[1]  H. D. Block The perceptron: a model for brain functioning. I , 1962 .

[2]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[3]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[4]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[5]  Y. Censor,et al.  An iterative row-action method for interval convex programming , 1981 .

[6]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[7]  D. Angluin Queries and Concept Learning , 1988 .

[8]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[9]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[10]  Nick Littlestone,et al.  Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow , 1991, COLT '91.

[11]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[12]  P. Gács,et al.  Algorithms , 1992 .

[13]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[14]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[15]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[16]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[17]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[18]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[19]  Manfred K. Warmuth,et al.  How to use expert advice , 1997, JACM.

[20]  Vladimir Vovk,et al.  Derandomizing Stochastic Prediction Strategies , 1997, COLT '97.

[21]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT.

[22]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[23]  Y. Censor,et al.  Parallel Optimization: Theory, Algorithms, and Applications , 1997 .

[24]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[25]  Nicolò Cesa-Bianchi,et al.  Analysis of two gradient-based algorithms for on-line regression , 1997, COLT '97.

[26]  Tom Bylander,et al.  The binary exponentiated gradient algorithm for learning linear functions , 1997, COLT '97.

[27]  Kenji Yamanishi,et al.  A Decision-Theoretic Extension of Stochastic Complexity and Its Applications to Learning , 1998, IEEE Trans. Inf. Theory.

[28]  Claudio Gentile,et al.  Linear Hinge Loss and Average Margin , 1998, NIPS.

[29]  Mark Herbster,et al.  Tracking the best regressor , 1998, COLT' 98.

[30]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[31]  Nicolò Cesa-Bianchi,et al.  Analysis of Two Gradient-Based Algorithms for On-Line Regression , 1999 .

[32]  Geoffrey J. Gordon Regret bounds for prediction problems , 1999, COLT '99.

[33]  Manfred K. Warmuth,et al.  Averaging Expert Predictions , 1999, EuroCOLT.

[34]  Peter Sollich,et al.  Advances in neural information processing systems 11 , 1999 .

[35]  Manfred K. Warmuth,et al.  Relative loss bounds for single neurons , 1999, IEEE Trans. Neural Networks.

[36]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[37]  Peter Auer,et al.  Tracking the Best Disjunction , 1998, Machine Learning.

[38]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[39]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[40]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[41]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.