论文信息 - A new Q(lambda) with interim forward view and Monte Carlo equivalence - 字舞流文

A new Q(lambda) with interim forward view and Monte Carlo equivalence

Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The λ parameter smoothly shifts on-policy algorithms such as TD(λ) and Sarsa(λ) from a pure bootstrapping form (λ=0) to a pure Monte Carlo form (λ=1). In off-policy algorithms, including Q(λ), GQ(λ), and off-policy LSTD(λ), the λ parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of λ, and as a result they fail to approximate Monte Carlo learning when λ = 1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be 'un-made' upon deviation from the target policy? In this paper, we introduce a new version of Q(λ) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(λ), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(λ) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(λ), called PTD(λ), and then our new Q(λ), called PQ(λ).

Doina Precup | Richard S. Sutton | Hado van Hasselt | Ashique Rupam Mahmood | R. Sutton | Doina Precup | H. V. Hasselt | A. Mahmood

[1] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[2] Shimon Whiteson,et al. A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[3] Peter Dayan,et al. Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[4] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[5] Huizhen Yu,et al. Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[6] Richard S. Sutton,et al. GQ(lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces , 2010, Artificial General Intelligence.

[7] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[8] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[9] Matthieu Geist,et al. Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[10] Richard S. Sutton,et al. Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[11] Scott Sanner,et al. Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda , 2010, ICML.

[12] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[14] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[15] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[16] Richard S. Sutton,et al. True online TD(λ) , 2014, ICML 2014.

[17] Michael Kearns,et al. Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[18] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[19] Richard S. Sutton,et al. Off-policy TD( l) with a true online equivalence , 2014, UAI.

[20] Richard S. Sutton,et al. TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.