Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.

[1]  S. Kakade,et al.  Reinforcement Learning: Theory and Algorithms , 2019 .

[2]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[3]  H. Robbins A Stochastic Approximation Method , 1951 .

[4]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[5]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[6]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[7]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[8]  Xian Wu,et al.  Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[9]  Shie Mannor,et al.  Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[10]  Martin J. Wainwright,et al.  Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[11]  Lucien Birg Approximation dans les espaces m?triques et th?orie de l'estimation , 1983 .

[12]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[13]  Thinh T. Doan,et al.  Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation , 2019, SIAM J. Math. Data Sci..

[14]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[15]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[16]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[17]  Frank E. Grubbs,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[18]  V. B. Tadic,et al.  On the almost sure rate of convergence of linear stochastic approximation algorithms , 2004, IEEE Transactions on Information Theory.

[19]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[20]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[21]  Martin J. Wainwright,et al.  Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning , 2019, 1905.06265.

[22]  L. L. Cam,et al.  Limits of experiments , 1972 .

[23]  Csaba Szepesvári,et al.  Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[24]  Martin J. Wainwright,et al.  Instance-Dependent ℓ∞-Bounds for Policy Evaluation in Tabular Reinforcement Learning , 2021, IEEE Transactions on Information Theory.

[25]  Jalaj Bhandari,et al.  A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[26]  Lucien Birgé Approximation dans les espaces métriques et théorie de l'estimation , 1983 .

[27]  T. N. Sriram Asymptotics in Statistics–Some Basic Concepts , 2002 .

[28]  C. Stein Efficient Nonparametric Testing and Estimation , 1956 .

[29]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30]  J. Hájek Local asymptotic minimax and admissibility in estimation , 1972 .

[31]  Martin J. Wainwright,et al.  Value function estimation in Markov reward processes: Instance-dependent 𝓁∞-bounds for policy evaluation , 2019, ArXiv.

[32]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[33]  Shie Mannor,et al.  How hard is my MDP?" The distribution-norm to the rescue" , 2014, NIPS.

[34]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[35]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[36]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[37]  Zhaoran Wang,et al.  Variance Reduced Policy Evaluation with Smooth Function Approximation , 2019, NeurIPS.

[38]  Hilbert J. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[39]  R. Durrett Essentials of Stochastic Processes , 1999 .

[40]  Nan Jiang,et al.  Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon , 2018, COLT.

[41]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[42]  T. Cai,et al.  An adaptation theory for nonparametric confidence intervals , 2004, math/0503662.

[43]  D. V. Lindley,et al.  An Introduction to Probability Theory and Its Applications. Volume II , 1967, The Mathematical Gazette.

[44]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Stochastic Control , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[45]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[46]  D. Donoho,et al.  Geometrizing Rates of Convergence, III , 1991 .

[47]  Tor Lattimore,et al.  Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[48]  Nathaniel Korda,et al.  On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[49]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[50]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[51]  Yingbin Liang,et al.  Reanalysis of Variance Reduced Temporal Difference Learning , 2020, ICLR.

[52]  D. Donoho,et al.  Geometrizing Rates of Convergence , II , 2008 .