On Generalized Bellman Equations and Temporal-Difference Learning

We consider off-policy temporal-difference (TD) learning in discounted Markov decision processes, where the goal is to evaluate a policy in a model-free way by using observations of a state process generated without executing the policy. To curb the high variance issue in off-policy TD learning, we propose a new scheme of setting the \(\lambda \) parameters of TD, based on generalized Bellman equations. Our scheme is to set \(\lambda \) according to the eligibility trace iterates calculated in TD, thereby easily keeping these traces in a desired bounded range. Compared to prior works, this scheme is more direct and flexible, and allows much larger \(\lambda \) values for off-policy TD learning with bounded traces. Using Markov chain theory, we prove the ergodicity of the joint state-trace process under nonrestrictive conditions, and we show that associated with our scheme is a generalized Bellman equation (for the policy to be evaluated) that depends on both \(\lambda \) and the unique invariant probability measure of the state-trace process. These results not only lead immediately to a characterization of the convergence behavior of least-squares based implementation of our scheme, but also prepare the ground for further analysis of gradient-based implementations.

[1]  Huizhen Yu,et al.  Weak Convergence Properties of Constrained Emphatic Temporal-difference Learning with Constant and Slowly Diminishing Stepsize , 2015, J. Mach. Learn. Res..

[2]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[3]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[4]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[5]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[6]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[7]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[8]  S. Meyn Ergodic theorems for discrete time stochastic systems using a stochastic lyapunov function , 1989 .

[9]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[10]  Shie Mannor,et al.  The Cross Entropy Method for Fast Policy Search , 2003, ICML.

[11]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[12]  Motoaki Kawanabe,et al.  Generalized TD Learning , 2011, J. Mach. Learn. Res..

[13]  E. Seneta Non-negative Matrices and Markov Chains , 2008 .

[14]  A. C. Brooms Stochastic Approximation and Recursive Algorithms with Applications, 2nd edn by H. J. Kushner and G. G. Yin , 2006 .

[15]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[16]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[17]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[18]  D. Bertsekas,et al.  Weighted Bellman Equations and their Applications in Approximate Dynamic Programming ∗ , 2012 .

[19]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[20]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[21]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[22]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[23]  Huizhen Yu,et al.  On Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning , 2017, ArXiv.

[24]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[25]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[26]  M. Schäl,et al.  Stationary policies and Markov policies in Borel dynamic programming , 1987 .

[27]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[28]  R. Cooke Real and Complex Analysis , 2011 .

[29]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[30]  R. Sutton The Grand Challenge of Predictive Empirical Abstract Knowledge , 2009 .

[31]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[32]  Huizhen Yu,et al.  On Convergence of Emphatic Temporal-Difference Learning , 2015, COLT.

[33]  Matthieu Geist,et al.  Off-policy learning with eligibility traces: a survey , 2013, J. Mach. Learn. Res..

[34]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[35]  R. S. Randhawa,et al.  Combining importance sampling and temporal difference control variates to simulate Markov Chains , 2004, TOMC.

[36]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[37]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[38]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[39]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[40]  Huizhen Yu Some Simulation Results for Emphatic Temporal-Difference Learning Algorithms , 2016, ArXiv.

[41]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[42]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[43]  Bo Liu,et al.  Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces , 2014, ArXiv.

[44]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[45]  Huizhen Yu,et al.  Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[46]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[47]  E. Nummelin General irreducible Markov chains and non-negative operators: List of symbols and notation , 1984 .

[48]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[49]  Sayan Mukherjee,et al.  Bayesian group factor analysis with structured sparsity , 2016, J. Mach. Learn. Res..

[50]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[51]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .