Q-learning and enhanced policy iteration in discounted dynamic programming

We consider the classical finite-state discounted Markovian decision problem, and we introduce a new policy iteration-like algorithm for finding the optimal Q-factors. Instead of policy evaluation by solving a linear system of equations, our algorithm involves (possibly inexact) solution of an optimal stopping problem. This problem can be solved with simple Q-learning iterations, in the case where a lookup table representation is used; it can also be solved with the Q-learning algorithm of Tsitsiklis and Van Roy [TsV99], in the case where feature-based Q-factor approximations are used. In exact/lookup table representation form, our algorithm admits asynchronous and stochastic iterative implementations, in the spirit of asynchronous/modified policy iteration, with lower overhead advantages over existing Q-learning schemes. Furthermore, for large-scale problems, where linear basis function approximations and simulation-based temporal difference implementations are used, our algorithm resolves effectively the inherent difficulties of existing schemes due to inadequate exploration.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[3]  Richard E. Rosenthal,et al.  Stochastic Dynamic Location Analysis , 1978 .

[4]  Dimitri P. Bertsekas,et al.  Distributed asynchronous computation of fixed points , 1983, Math. Program..

[5]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[6]  C. Watkins Learning from delayed rewards , 1989 .

[7]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[8]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[9]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[10]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[11]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[12]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[13]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[14]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[15]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[16]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[17]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[18]  Vivek S. Borkar,et al.  Stochastic Approximation for Nonexpansive Maps: Application to Q-Learning Algorithms , 1997, SIAM J. Control. Optim..

[19]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[20]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[21]  John S. Baras,et al.  A learning algorithm for Markov decision processes with adaptive state aggregation , 2000, Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187).

[22]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[23]  Ying He,et al.  Simulation-Based Algorithms for Markov Decision Processes , 2002 .

[24]  John N. Tsitsiklis,et al.  On the Convergence of Optimistic Policy Iteration , 2002, J. Mach. Learn. Res..

[25]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[26]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[27]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[28]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[29]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[30]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[31]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[32]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[33]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[34]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[35]  Jiaqiao Hu,et al.  Simulation-based Algorithms for Markov Decision Processes (Communications and Control Engineering) , 2007 .

[36]  T. Jung,et al.  Kernelizing LSPE(λ) , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[37]  Xi-Ren Cao,et al.  Stochastic Learning and Optimization: A Sensitivity-Based Approach (International Series on Discrete Event Dynamic Systems) , 2007 .

[38]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[39]  D. Bertsekas,et al.  A Least Squares Q-Learning Algorithm for Optimal Stopping Problems , 2007 .

[40]  Warren B. Powell,et al.  Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[41]  D. Bertsekas,et al.  Q-learning algorithms for optimal stopping based on least squares , 2007, 2007 European Control Conference (ECC).

[42]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[43]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[44]  Shalabh Bhatnagar,et al.  New algorithms of the Q-learning type , 2008, Autom..

[45]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[46]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[47]  Dimitri P. Bertsekas,et al.  Basis function adaptation methods for cost approximation in MDP , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[48]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[49]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annual Reviews in Control.

[50]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[51]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[52]  Dimitri P. Bertsekas,et al.  Distributed asynchronous policy iteration in dynamic programming , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[53]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, CDC.

[54]  Bart De Schutter,et al.  Online least-squares policy iteration for reinforcement learning control , 2010, Proceedings of the 2010 American Control Conference.

[55]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[56]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[57]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[58]  Benjamin Van Roy On Regression-Based Stopping Times , 2010, Discret. Event Dyn. Syst..

[59]  Dimitri P. Bertsekasy Williams-Baird Counterexample for Q-Factor Asynchronous Policy Iteration , 2010 .

[60]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[61]  Huizhen Yu,et al.  Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[62]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.

[63]  J. Walrand,et al.  Distributed Dynamic Programming , 2022 .