论文信息 - Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps. To learn the value function for horizon $h$, these algorithms bootstrap from the value function for horizon $h-1$, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as "the deadly triad"). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and $n$-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad.

Richard S. Sutton | Silviu Pitis | Daniel Graves | Kristopher De Asis | Alan Chan

[1] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[2] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[3] Tsuyoshi Murata,et al. {m , 1934, ACML.

[4] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[5] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[6] Reuven Y. Rubinstein,et al. Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[7] Martha White,et al. Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[8] Leemon C. Baird,et al. Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[9] Surya Ganguli,et al. On the saddle point problem for non-convex optimization , 2014, ArXiv.

[10] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[11] Csaba Szepesvári,et al. Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[12] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[13] Jeffrey Pennington,et al. Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[14] Martha White,et al. Online Off-policy Prediction , 2018, ArXiv.

[15] P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[16] Harm van Seijen,et al. Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning , 2019, NeurIPS.

[17] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[18] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[19] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[20] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[22] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[23] Richard S. Sutton,et al. Learning to Predict Independent of Span , 2015, ArXiv.

[24] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[25] Geoffrey J. Gordon. Stable Function Approximation in Dynamic Programming , 1995, ICML.

[26] Wojciech Zaremba,et al. OpenAI Gym , 2016, ArXiv.

[27] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[28] John N. Tsitsiklis,et al. Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[29] Pierre Priouret,et al. Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[30] Francisco S. Melo,et al. Q -Learning with Linear Function Approximation , 2007, COLT.

[32] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[33] Yoshua Bengio,et al. Hyperbolic Discounting and Learning over Multiple Horizons , 2019, ArXiv.