Approximate policy iteration: a survey and some new methods

We consider the classical policy iteration method of dynamic programming (DP), where approximations and simulation are used to deal with the curse of dimensionality. We survey a number of issues: convergence and rate of convergence of approximate policy evaluation methods, singularity and susceptibility to simulation noise of policy evaluation, exploration issues, constrained and enhanced policy iteration, policy oscillation and chattering, and optimistic and distributed policy iteration. Our discussion of policy evaluation is couched in general terms and aims to unify the available methods in the light of recent research developments and to compare the two main policy evaluation approaches: projected equations and temporal differences (TD), and aggregation. In the context of these approaches, we survey two different types of simulation-based algorithms: matrix inversion methods, such as least-squares temporal difference (LSTD), and iterative methods, such as least-squares policy evaluation (LSPE) and TD (λ), and their scaled variants. We discuss a recent method, based on regression and regularization, which rectifies the unreliability of LSTD for nearly singular projected Bellman equations. An iterative version of this method belongs to the LSPE class of methods and provides the connecting link between LSTD and LSPE. Our discussion of policy improvement focuses on the role of policy oscillation and its effect on performance guarantees. We illustrate that policy evaluation when done by the projected equation/TD approach may lead to policy oscillation, but when done by aggregation it does not. This implies better error bounds and more regular performance for aggregation, at the expense of some loss of generality in cost function representation capability. Hard aggregation provides the connecting link between projected equation/TD-based and aggregation-based policy evaluation, and is characterized by favorable error bounds.

[1]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[3]  E. Denardo CONTRACTION MAPPINGS IN THE THEORY UNDERLYING DYNAMIC PROGRAMMING , 1967 .

[4]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[5]  B. Martinet,et al.  R'egularisation d''in'equations variationnelles par approximations successives , 1970 .

[6]  B. Martinet Brève communication. Régularisation d'inéquations variationnelles par approximations successives , 1970 .

[7]  M. A. Krasnoselʹskii Approximate Solution of Operator Equations , 1972 .

[8]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[9]  D. Bertsekas Monotone Mappings with Application in Dynamic Programming , 1977 .

[10]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[11]  Dirk P. Kroese,et al.  Simulation and the Monte Carlo Method (Wiley Series in Probability and Statistics) , 1981 .

[12]  Dimitri Bertsekas,et al.  Distributed dynamic programming , 1981, 1981 20th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[13]  D. Bertsekas,et al.  Projection methods for variational inequalities with application to the traffic assignment problem , 1982 .

[14]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[15]  C. Fletcher Computational Galerkin Methods , 1983 .

[16]  Max Donath,et al.  American Control Conference , 1993 .

[17]  Ronald J. Williams,et al.  Analysis of Some Incremental Variants of Policy Iteration: First Steps Toward Understanding Actor-Cr , 1993 .

[18]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Michael I. Jordan,et al.  Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[21]  A. Harry Klopf,et al.  Advantage Updating Applied to a Differrential Game , 1994, NIPS.

[22]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[23]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[24]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[25]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[26]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[27]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[28]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[29]  Fernando J. Pineda,et al.  Mean-Field Theory for Batched TD() , 1997, Neural Computation.

[30]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[31]  Guanrong Chen,et al.  Approximate Solutions of Operator Equations , 1997 .

[32]  L. Trefethen,et al.  Numerical linear algebra , 1997 .

[33]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[34]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[35]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[36]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[37]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[38]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[39]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[40]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[41]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[42]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[43]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[44]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[45]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[46]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[47]  Jennie Si,et al.  Handbook of Learning and Approximate Dynamic Programming (IEEE Press Series on Computational Intelligence) , 2004 .

[48]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[49]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[50]  Shie Mannor,et al.  Basis Function Adaptation in Temporal Difference Reinforcement Learning , 2005, Ann. Oper. Res..

[51]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[52]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[53]  András Lörincz,et al.  Learning Tetris Using the Noisy Cross-Entropy Method , 2006, Neural Computation.

[54]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[55]  G. Calafiore,et al.  Probabilistic and Randomized Methods for Design under Uncertainty , 2006 .

[56]  Benjamin Van Roy,et al.  Tetris: A Study of Randomized Constraint Sampling , 2006 .

[57]  Benjamin Van Roy Performance Loss Bounds for Approximate Value Iteration with State Aggregation , 2006, Math. Oper. Res..

[58]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[59]  Xi-Ren Cao,et al.  Stochastic learning and optimization - A sensitivity-based approach , 2007, Annu. Rev. Control..

[60]  D. Bertsekas,et al.  Solution of Large Systems of Equations Using Approximate Dynamic Programming Methods , 2007 .

[61]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[62]  T. Jung,et al.  Kernelizing LSPE(λ) , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[63]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[64]  D. Bertsekas,et al.  A Least Squares Q-Learning Algorithm for Optimal Stopping Problems , 2007 .

[65]  Frank L. Lewis,et al.  Guest Editorial: Special Issue on Adaptive Dynamic Programming and Reinforcement Learning in Feedback Control , 2008, IEEE Trans. Syst. Man Cybern. Part B.

[66]  R. Sutton,et al.  A convergent O ( n ) algorithm for off-policy temporal-difference learning with linear function approximation , 2008, NIPS 2008.

[67]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[68]  Zhi-Qiang Liu,et al.  Preconditioned temporal difference learning , 2008, ICML '08.

[69]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[70]  Vivek S. Borkar,et al.  Reinforcement Learning — A Bridge Between Numerical Methods and Monte Carlo , 2009 .

[71]  Panos M. Pardalos,et al.  Approximate dynamic programming: solving the curses of dimensionality , 2009, Optim. Methods Softw..

[72]  Dimitri P. Bertsekas,et al.  Basis function adaptation methods for cost approximation in MDP , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[73]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[74]  F.L. Lewis,et al.  Reinforcement learning and adaptive dynamic programming for feedback control , 2009, IEEE Circuits and Systems Magazine.

[75]  D. Bertsekas,et al.  Approximate Solution of Large-Scale Linear Inverse Problems with Monte Carlo Simulation ∗ , 2009 .

[76]  Vivek F. Farias,et al.  A Smoothed Approximate Linear Program , 2009, NIPS.

[77]  Richard W. Cottle,et al.  Linear Complementarity Problem , 2009, Encyclopedia of Optimization.

[78]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[79]  D. Bertsekas Projected Equations, Variational Inequalities, and Temporal Difference Methods , 2009 .

[80]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[81]  Dimitri P. Bertsekas,et al.  Convex Optimization Theory , 2009 .

[82]  D. Bertsekas,et al.  Journal of Computational and Applied Mathematics Projected Equation Methods for Approximate Solution of Large Linear Systems , 2022 .

[83]  Bruno Scherrer,et al.  Improvements on Learning Tetris with Cross Entropy , 2009, J. Int. Comput. Games Assoc..

[84]  Dimitri P. Bertsekas,et al.  Distributed asynchronous policy iteration in dynamic programming , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[85]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[86]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[87]  Bart De Schutter,et al.  Online least-squares policy iteration for reinforcement learning control , 2010, Proceedings of the 2010 American Control Conference.

[88]  B. Scherrer,et al.  Least-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems , 2010, ICML.

[89]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[90]  Shie Mannor,et al.  Adaptive Bases for Reinforcement Learning , 2010, ECML/PKDD.

[91]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[92]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[93]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[94]  Dimitri P. Bertsekas,et al.  Pathologies of temporal difference methods in approximate dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[95]  Panos M. Pardalos,et al.  Convex optimization theory , 2010, Optim. Methods Softw..

[96]  Csaba Szepesvári,et al.  Reinforcement Learning Algorithms for MDPs , 2011 .

[97]  Bart De Schutter,et al.  Cross-Entropy Optimization of Control Policies With Adaptive Basis Functions , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[98]  Huizhen Yu,et al.  Least Squares Temporal Difference Methods: An Analysis under General Conditions , 2012, SIAM J. Control. Optim..

[99]  Steven I. Marcus,et al.  Simulation-based Algorithms for Markov Decision Processes/ Hyeong Soo Chang ... [et al.] , 2013 .

[100]  J. Walrand,et al.  Distributed Dynamic Programming , 2022 .