How to use expert advice

We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictins. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show how this leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently know in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes.

[1]  Thomas M. Cover,et al.  Behavior of sequential predictors of binary sequences , 1965 .

[2]  Thomas M. Cover,et al.  Compound Bayes Predictors for Sequences with Apparent Markov Structure , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  J. D. T. Oliveira,et al.  The Asymptotic Theory of Extreme Order Statistics , 1979 .

[4]  M. Stone Cross-validation:a review 2 , 1978 .

[5]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[6]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[7]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[8]  A. P. Dawid,et al.  Present position and potential developments: some personal views , 1984 .

[9]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[10]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[11]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[12]  David Haussler,et al.  Equivalence of models for polynomial learnability , 1988, COLT '88.

[13]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[14]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[15]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[16]  Sompolinsky,et al.  Learning from examples in large neural networks. , 1990, Physical review letters.

[17]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[18]  Amos Fiat,et al.  Competitive Paging Algorithms , 1991, J. Algorithms.

[19]  Yuval Rabani,et al.  Competitive algorithms for layered graph traversal , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[20]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[21]  David Haussler,et al.  HOW WELL DO BAYES METHODS WORK FOR ON-LINE PREDICTION OF {+- 1} VALUES? , 1992 .

[22]  A. P. Dawid,et al.  Prequential data analysis , 1992 .

[23]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[24]  Linda Sellie,et al.  Toward efficient agnostic learning , 1992, COLT '92.

[25]  Vladimir Vovk,et al.  Universal Forecasting Algorithms , 1992, Inf. Comput..

[26]  V. Vovk A logic of probability, with application to the foundations of statistics , 1993 .

[27]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[28]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[29]  Manfred K. Warmuth,et al.  Using experts for predicting continuous outcomes , 1994, European Conference on Computational Learning Theory.

[30]  Neri Merhav,et al.  Universal schemes for sequential decision from individual data sequences , 1993, IEEE Trans. Inf. Theory.

[31]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[32]  Thomas H. Chung,et al.  Approximate methods for sequential decision making using expert advice , 1994, COLT '94.

[33]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[34]  Yuval Rabani,et al.  Competitive k-server algorithms , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[35]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[36]  David Haussler,et al.  Tight worst-case loss bounds for predicting with expert advice , 1994, EuroCOLT.

[37]  Kenji Yamanishi,et al.  A Loss Bound Model for On-Line Stochastic Prediction Algorithms , 1995, Inf. Comput..

[38]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[39]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .