How to use expert advice

We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called `experts''. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show how this leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently known in this context. We also extend our analysis to the case in which log loss is used instead of the expected number of mistakes.

[1]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[2]  Thomas M. Cover,et al.  Behavior of sequential predictors of binary sequences , 1965 .

[3]  Thomas M. Cover,et al.  Compound Bayes Predictors for Sequences with Apparent Markov Structure , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[5]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[6]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[7]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[8]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[9]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[10]  David Haussler,et al.  Equivalence of models for polynomial learnability , 1988, COLT '88.

[11]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[12]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[13]  J. Angus The Asymptotic Theory of Extreme Order Statistics , 1990 .

[14]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[15]  Robert E. Schapire,et al.  Efficient Distribution-free Learning of Probabilistic Concepts (Extended Abstract) , 1990, FOCS 1990.

[16]  Sompolinsky,et al.  Learning from examples in large neural networks. , 1990, Physical review letters.

[17]  David Haussler,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1991, COLT '91.

[18]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[19]  Amos Fiat,et al.  Competitive Paging Algorithms , 1991, J. Algorithms.

[20]  Yuval Rabani,et al.  Competitive algorithms for layered graph traversal , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[21]  Kenji Yamanishi A loss bound model for on-line stochastic prediction strategies , 1991, COLT '91.

[22]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[23]  Sompolinsky,et al.  Statistical mechanics of learning from examples. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[24]  Manfred K. Warmuth,et al.  Some weak learning results , 1992, COLT '92.

[25]  Michael Kearns,et al.  Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[26]  David Haussler,et al.  HOW WELL DO BAYES METHODS WORK FOR ON-LINE PREDICTION OF {+- 1} VALUES? , 1992 .

[27]  R. Schapire Toward Eecient Agnostic Learning , 1992 .

[28]  A. P. Dawid,et al.  Prequential data analysis , 1992 .

[29]  Neri Merhav,et al.  Universal sequential learning and decision from individual data sequences , 1992, COLT '92.

[30]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[31]  Vladimir Vovk,et al.  Universal Forecasting Algorithms , 1992, Inf. Comput..

[32]  V. Vovk A logic of probability, with application to the foundations of statistics , 1993 .

[33]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[34]  Manfred K. Warmuth,et al.  Using experts for predicting continuous outcomes , 1994, European Conference on Computational Learning Theory.

[35]  Neri Merhav,et al.  Universal schemes for sequential decision from individual data sequences , 1993, IEEE Trans. Inf. Theory.

[36]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[37]  Thomas H. Chung,et al.  Approximate methods for sequential decision making using expert advice , 1994, COLT '94.

[38]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[39]  Yuval Rabani,et al.  Competitive k-server algorithms , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[40]  Robert E. Schapire,et al.  Efficient distribution-free learning of probabilistic concepts , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[41]  David Haussler,et al.  Tight worst-case loss bounds for predicting with expert advice , 1994, EuroCOLT.

[42]  Kenji Yamanishi,et al.  A Loss Bound Model for On-Line Stochastic Prediction Algorithms , 1995, Inf. Comput..

[43]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..