Relative Loss Bounds for Temporal-Difference Learning

Foster and Vovk proved relative loss bounds for linear regression where the total loss of the on-line algorithm minus the total loss of the best linear predictor (chosen in hindsight) grows logarithmically with the number of trials. We give similar bounds for temporal-difference learning. Learning takes place in a sequence of trials where the learner tries to predict discounted sums of future reinforcement signals. The quality of the predictions is measured with the square loss and we bound the total loss of the on-line algorithm minus the total loss of the best linear predictor for the whole sequence of trials. Again the difference of the losses is logarithmic in the number of trials. The bounds hold for an arbitrary (worst-case) sequence of examples. We also give a bound on the expected difference for the case when the instances are chosen from an unknown distribution. For linear regression a corresponding lower bound shows that this expected bound cannot be improved substantially.

[1]  T. A. A. Broadbent,et al.  Survey of Applicable Mathematics , 1970, Mathematical Gazette.

[2]  David Haussler,et al.  On the Complexity of Iterated Shuffle , 1984, J. Comput. Syst. Sci..

[3]  Hagit Attiya,et al.  Computing on an anonymous ring , 1988, JACM.

[4]  Bernard Widrow,et al.  Adaptive Signal Processing , 1985 .

[5]  David Haussler,et al.  Classifying learnable geometric concepts with the Vapnik-Chervonenkis dimension , 1986, STOC '86.

[6]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[7]  Manfred K. Warmuth,et al.  Membership for Growing Context-Sensitive Grammars is Polynomial , 1986, J. Comput. Syst. Sci..

[8]  Leonard Pitt,et al.  Reductions among prediction problems: on the difficulty of predicting automata , 1988, [1988] Proceedings. Structure in Complexity Theory Third Annual Conference.

[9]  David Haussler,et al.  Predicting {0,1}-functions on randomly drawn points , 1988, COLT '88.

[10]  David Haussler,et al.  Equivalence of models for polynomial learnability , 1988, COLT '88.

[11]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[12]  Manfred K. Warmuth Towards Representation Independence in PAC Learning , 1989, AII.

[13]  Richard J. Anderson,et al.  Parallel Approximation Algorithms for Bin Packing , 1988, Inf. Comput..

[14]  Manfred K. Warmuth,et al.  Learning integer lattices , 1990, COLT '90.

[15]  B. Bollobás Linear analysis : an introductory course , 1990 .

[16]  Philip M. Long,et al.  Composite geometric concepts and polynomial predictability , 1990, COLT '90.

[17]  Naoki Abe,et al.  Polynomial learnability of probabilistic concepts with respect to the Kullback-Leibler divergence , 1991, COLT '91.

[18]  Dean Phillips Foster Prediction in the Worst Case , 1991 .

[19]  Philip M. Long,et al.  On-line learning of linear functions , 1991, STOC '91.

[20]  Manfred K. Warmuth,et al.  Some weak learning results , 1992, COLT '92.

[21]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[22]  Manfred K. Warmuth,et al.  Gap Theorems for Distributed Computation , 1993, SIAM J. Comput..

[23]  Leonard Pitt,et al.  The minimum consistent DFA problem cannot be approximated within any polynomial , 1993, JACM.

[24]  David Haussler,et al.  The Probably Approximately Correct (PAC) and Other Learning Models , 1993 .

[25]  Manfred K. Warmuth,et al.  Using experts for predicting continuous outcomes , 1994, European Conference on Computational Learning Theory.

[26]  Philip M. Long,et al.  Worst-case quadratic loss bounds for a generalization of the Widrow-Hoff rule , 1993, COLT '93.

[27]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[28]  Manfred K. Warmuth,et al.  The Distributed Bit Complexity of the Ring: From the Anonymous to the Non-anonymous Case , 1989, Inf. Comput..

[29]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[30]  David Haussler,et al.  Tight worst-case loss bounds for predicting with expert advice , 1994, EuroCOLT.

[31]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[32]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[33]  Amara Lynn Graps,et al.  An introduction to wavelets , 1995 .

[34]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[35]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[36]  Philip M. Long,et al.  Worst-case quadratic loss bounds for prediction using linear functions and gradient descent , 1996, IEEE Trans. Neural Networks.

[37]  Yoram Singer,et al.  On‐Line Portfolio Selection Using Multiplicative Updates , 1998, ICML.

[38]  Stephen Kwek,et al.  Learning of depth two neural networks with constant fan-in at the hidden nodes (extended abstract) , 1996, COLT '96.

[39]  Yoram Singer,et al.  Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions , 1996, NIPS.

[40]  Vladimir Vovk,et al.  Competitive On-line Linear Regression , 1997, NIPS.

[41]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[42]  Yoram Singer,et al.  Batch and On-Line Parameter Estimation of Gaussian Mixtures Based on the Joint Entropy , 1998, NIPS.

[43]  Claudio Gentile,et al.  Linear Hinge Loss and Average Margin , 1998, NIPS.

[44]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[45]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[46]  Mark Herbster,et al.  Tracking the best regressor , 1998, COLT' 98.

[47]  Manfred K. Warmuth,et al.  Efficient Learning With Virtual Threshold Gates , 1995, Inf. Comput..

[48]  Manfred K. Warmuth,et al.  Predicting nearly as well as the best pruning of a planar decision graph , 2002, Theor. Comput. Sci..

[49]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[50]  Manfred K. Warmuth,et al.  Averaging Expert Predictions , 1999, EuroCOLT.

[51]  Jürgen Forster,et al.  On Relative Loss Bounds in Generalized Linear Regression , 1999, FCT.

[52]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[53]  Manfred K. Warmuth,et al.  Direct and Indirect Algorithms for On-line Learning of Disjunctions , 1999, EuroCOLT.

[54]  Gunnar Rätsch,et al.  Barrier Boosting , 2000, COLT.

[55]  Manfred K. Warmuth,et al.  The Last-Step Minimax Algorithm , 2000, ALT.

[56]  E. Takimoto,et al.  The Minimax Strategy for Gaussian Density Estimation , 2000 .

[57]  Manfred K. Warmuth,et al.  Tracking a Small Set of Experts by Mixing Past Posteriors , 2003, J. Mach. Learn. Res..

[58]  Gunnar Rätsch,et al.  Active Learning in the Drug Discovery Process , 2001, NIPS.

[59]  Manfred K. Warmuth Compressing to VC Dimension Many Points , 2003, COLT.

[60]  W. Kester Fast Fourier Transforms , 2003 .

[61]  Peter Auer,et al.  Tracking the Best Disjunction , 1998, Machine Learning.

[62]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[63]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[64]  Manfred K. Warmuth,et al.  Learning Binary Relations Using Weighted Majority Voting , 1995, Machine Learning.

[65]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[66]  Manfred K. Warmuth,et al.  Learning nested differences of intersection-closed concept classes , 2004, Machine Learning.

[67]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[68]  Manfred K. Warmuth The Optimal PAC Algorithm , 2004, COLT.

[69]  Manfred K. Warmuth,et al.  On the Worst-Case Analysis of Temporal-Difference Learning Algorithms , 2005, Machine Learning.

[70]  Naoki Abe,et al.  On the computational complexity of approximating distributions by probabilistic automata , 1990, Machine Learning.

[71]  Yoram Singer,et al.  A Comparison of New and Old Algorithms for a Mixture Estimation Problem , 1995, COLT '95.

[72]  Nicolò Cesa-Bianchi,et al.  On-line Prediction and Conversion Strategies , 1994, Machine Learning.

[73]  Manfred K. Warmuth,et al.  Optimum Follow the Leader Algorithm , 2005, COLT.

[74]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[75]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[76]  Manfred K. Warmuth Can Entropic Regularization Be Replaced by Squared Euclidean Distance Plus Additional Linear Constraints , 2006, COLT.