论文信息 - Tracking the best regressor

Tracking the best regressor

In most of the on-line learning research the total on-line loss of the algorithm is compared to the total loss of the best off-line predictor u from a comparison class of predictors. We call such bounds static bounds. The interesting feature of these bounds is that they hold for an arbitrary sequence of examples. Recently some work has been done where the comparison vector ut at each trial t is allowed to change with time, and the total online loss of the algorithm is compared to the sum of the losses of ut at each trial plus the total “cost” for shifting to successive comparison vectors. This is to model situations in which the examples change over time and different predictors from the comparison class are best for different segments of the sequence of examples. We call such bounds shifting bounds. Shifting bounds still hold for arbitrary sequences of examples and also for arbitrary partitions. The algorithm does not know the offline partition and the sequence of predictors that its performance is compared against. Naturally shifting bounds are much harder to prove. The only known bounds are for the case when the comparison class consists of a finite sets of experts or boolean disjunctions. In this paper we develop the methodology for lifting known static bounds to the shifting case. In particular we obtain bounds when the comparison class consists of linear neurons (linear combinations of experts). Our essential technique consists of the following. At the end of each trial we project the hypothesis of the static algorithm into a suitably chosen convex region. This keeps the hypothesis of the algorithm well-behaved and the static bounds can be converted to shifting bounds so that the cost for shifting remains reasonable. *The authors were supported by the NSF grant CCR-9700201 Permission to make digital or h,a.rd copies of all or part oftbis work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. COLT 98 Madison WI 1JSA Copyright ACM 1998 I-581 13-057--0/9X/ 7...$5.00

Mark Herbster | Manfred K. Warmuth

[1] N. Littlestone. Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[2] N. Littlestone. Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[3] 丸山徹. Convex Analysisの二,三の進展について , 1977 .

[4] I. Csiszár. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems , 1991 .

[5] Manfred K. Warmuth,et al. Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[6] Philip M. Long,et al. WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[7] Yoav Freund,et al. A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[8] Manfred K. Warmuth,et al. Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[9] Vladimir Vovk,et al. Aggregating strategies , 1990, COLT '90.

[10] L. Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[11] Tom Bylander,et al. The binary exponentiated gradient algorithm for learning linear functions , 1997, COLT '97.