论文信息 - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis - 字舞流文

Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis

We address the problem of policy evaluation in discounted Markov decision processes, and provide instance-dependent guarantees on the $\ell_\infty$-error under a generative model. We establish both asymptotic and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms. Theory-inspired simulations show that the widely-used temporal difference (TD) algorithm is strictly suboptimal when evaluated in a non-asymptotic setting, even when combined with Polyak-Ruppert iterate averaging. We remedy this issue by introducing and analyzing variance-reduced forms of stochastic approximation, showing that they achieve non-asymptotic, instance-dependent optimality up to logarithmic factors.

Martin J. Wainwright | Michael I. Jordan | Feng Ruan | Koulik Khamaru | Ashwin Pananjady | M. Wainwright | A. Pananjady | K. Khamaru | Feng Ruan

[1] S. Kakade,et al. Reinforcement Learning: Theory and Algorithms , 2019 .

[2] Alexandre B. Tsybakov,et al. Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[3] H. Robbins. A Stochastic Approximation Method , 1951 .

[4] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[5] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[6] Martin J. Wainwright,et al. Stochastic approximation with cone-contractive operators: Sharp 𝓁∞-bounds for Q-learning , 2019, ArXiv.

[7] Jan Peters,et al. Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[8] Xian Wu,et al. Near-Optimal Time and Sample Complexities for Solving Markov Decision Processes with a Generative Model , 2018, NeurIPS.

[9] Shie Mannor,et al. Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[10] Martin J. Wainwright,et al. Variance-reduced Q-learning is minimax optimal , 2019, ArXiv.

[11] Lucien Birg. Approximation dans les espaces m?triques et th?orie de l'estimation , 1983 .

[12] D. Ruppert,et al. Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[13] Thinh T. Doan,et al. Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation , 2019, SIAM J. Math. Data Sci..

[14] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[15] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[16] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[17] Frank E. Grubbs,et al. An Introduction to Probability Theory and Its Applications , 1951 .

[18] V. B. Tadic,et al. On the almost sure rate of convergence of linear stochastic approximation algorithms , 2004, IEEE Transactions on Information Theory.

[19] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[20] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[21] Martin J. Wainwright,et al. Stochastic approximation with cone-contractive operators: Sharp $\ell_\infty$-bounds for $Q$-learning , 2019, 1905.06265.

[22] L. L. Cam,et al. Limits of experiments , 1972 .

[23] Csaba Szepesvári,et al. Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[24] Martin J. Wainwright,et al. Instance-Dependent ℓ∞-Bounds for Policy Evaluation in Tabular Reinforcement Learning , 2021, IEEE Transactions on Information Theory.

[25] Jalaj Bhandari,et al. A Finite Time Analysis of Temporal Difference Learning With Linear Function Approximation , 2018, COLT.

[26] Lucien Birgé. Approximation dans les espaces métriques et théorie de l'estimation , 1983 .

[27] T. N. Sriram. Asymptotics in Statistics–Some Basic Concepts , 2002 .

[28] C. Stein. Efficient Nonparametric Testing and Estimation , 1956 .

[29] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30] J. Hájek. Local asymptotic minimax and admissibility in estimation , 1972 .

[31] Martin J. Wainwright,et al. Value function estimation in Markov reward processes: Instance-dependent 𝓁∞-bounds for policy evaluation , 2019, ArXiv.

[32] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[33] Shie Mannor,et al. How hard is my MDP?" The distribution-norm to the rescue" , 2014, NIPS.

[34] Martin J. Wainwright,et al. High-Dimensional Statistics , 2019 .

[35] Eric Moulines,et al. Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[36] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[37] Zhaoran Wang,et al. Variance Reduced Policy Evaluation with Smooth Function Approximation , 2019, NeurIPS.

[38] Hilbert J. Kappen,et al. On the Sample Complexity of Reinforcement Learning with a Generative Model , 2012, ICML.

[39] R. Durrett. Essentials of Stochastic Processes , 1999 .

[40] Nan Jiang,et al. Open Problem: The Dependence of Sample Complexity Lower Bounds on Planning Horizon , 2018, COLT.

[41] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[42] T. Cai,et al. An adaptation theory for nonparametric confidence intervals , 2004, math/0503662.

[43] D. V. Lindley,et al. An Introduction to Probability Theory and Its Applications. Volume II , 1967, The Mathematical Gazette.

[44] Dimitri P. Bertsekas,et al. Dynamic Programming and Stochastic Control , 1977, IEEE Transactions on Systems, Man, and Cybernetics.

[45] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[46] D. Donoho,et al. Geometrizing Rates of Convergence, III , 1991 .

[47] Tor Lattimore,et al. Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[48] Nathaniel Korda,et al. On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[49] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[50] Michael Kearns,et al. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[51] Yingbin Liang,et al. Reanalysis of Variance Reduced Temporal Difference Learning , 2020, ICLR.

[52] D. Donoho,et al. Geometrizing Rates of Convergence , II , 2008 .