Tracking the best regressor

In most of the on-line learning research the total on-line loss of the algorithm is compared to the total loss of the best off-line predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the comparison vector ut at each trial t is allowed to change with time, and the total online loss of the algorithm is compared to the sum of the losses of ut at each trial plus the total “cost” for shifting to successive comparison vectors. This is to model situations in which the examples change over time and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. Shifting bounds still hold for arbitrary sequences of examples and also for arbitrary partitions. The algorithm does not know the offline partition and the sequence of predictors that its performance is compared against. Naturally shifting bounds are much harder to prove. The only known bounds are for the case when the comparison class consists of a finite sets of experts or boolean disjunctions. In this paper we develop the methodology for lifting known static bounds to the shifting case. In particular we obtain bounds when the comparison class consists of linear neurons (linear combinations of experts). Our essential technique consists of the following. At the end of each trial we project the hypothesis of the static algorithm into a suitably chosen convex region. This keeps the hypothesis of the algorithm well-behaved and the static bounds can be converted to shifting bounds so that the cost for shifting remains reasonable. *The authors were supported by the NSF grant CCR-9700201 Permission to make digital or h,a.rd copies of all or part oftbis work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. COLT 98 Madison WI 1JSA Copyright ACM 1998 I-581 13-057--0/9X/ 7...$5.00

[1]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[2]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[3]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[4]  I. Csiszár Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[5]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[6]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[9]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[10]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[11]  Tom Bylander,et al.  The binary exponentiated gradient algorithm for learning linear functions , 1997, COLT '97.

[12]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[13]  T. Cover Universal Portfolios , 1996 .

[14]  Charles L. Byrne,et al.  General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis , 1990, IEEE Trans. Inf. Theory.

[15]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[16]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[17]  Yoram Singer,et al.  On‐Line Portfolio Selection Using Multiplicative Updates , 1998, ICML.

[18]  Vladimir Vovk,et al.  Derandomizing Stochastic Prediction Strategies , 1997, COLT '97.

[19]  R. Lathe Phd by thesis , 1988, Nature.

[20]  Manfred K. Warmuth,et al.  Worst-case loss bounds for sigmoided linear neurons , 1995, NIPS 1995.

[21]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[22]  David Haussler,et al.  Tight worst-case loss bounds for predicting with expert advice , 1994, EuroCOLT.

[23]  Manfred K. Warmuth,et al.  Using experts for predicting continuous outcomes , 1994, European Conference on Computational Learning Theory.

[24]  Darrell D. E. Long,et al.  A dynamic disk spin-down technique for mobile computing , 1996, MobiCom '96.

[25]  Y. Censor,et al.  An iterative row-action method for interval convex programming , 1981 .