Least-Squares Methods for Policy Iteration

Approximate reinforcement learning deals with the essential problem of applying reinforcement learning in large and continuous state-action spaces, by using function approximators to represent the solution. This chapter reviews least-squares methods for policy iteration, an important class of algorithms for approximate reinforcement learning. We discuss three techniques for solving the core, policy evaluation component of policy iteration, called: least-squares temporal difference, least-squares policy evaluation, and Bellman residual minimization.We introduce these techniques starting from their general mathematical principles and detailing them down to fully specified algorithms. We pay attention to online variants of policy iteration, and provide a numerical example highlighting the behavior of representative offline and online methods. For the policy evaluation component as well as for the overall resulting approximate policy iteration, we provide guarantees on the performance obtained asymptotically, as the number of samples processed and iterations executed grows to infinity. We also provide finite-sample results, which apply when a finite number of samples and iterations are considered. Finally, we outline several extensions and improvements to the techniques and methods reviewed.

[1]  Gene H. Golub,et al.  Matrix computations , 1983 .

[2]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[3]  D. Zwillinger Least Squares Method , 1992 .

[4]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[5]  Dimitri P. Bertsekas,et al.  A Counterexample to Temporal Differences Learning , 1995, Neural Computation.

[6]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[7]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[10]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[11]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[12]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[13]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[14]  Ioannis Vlahavas,et al.  Methods and Applications of Artificial Intelligence , 2002, Lecture Notes in Computer Science.

[15]  John N. Tsitsiklis,et al.  On the Convergence of Optimistic Policy Iteration , 2002, J. Mach. Learn. Res..

[16]  Michail G. Lagoudakis,et al.  Least-Squares Methods in Reinforcement Learning for Control , 2002, SETN.

[17]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[18]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[19]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[20]  Dimitri P. Bertsekas,et al.  Least Squares Policy Evaluation Algorithms with Linear Function Approximation , 2003, Discret. Event Dyn. Syst..

[21]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[22]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[23]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[24]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[25]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[26]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[27]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[28]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[29]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[30]  Alborz Geramifard,et al.  Incremental Least-Squares Temporal Difference Learning , 2006, AAAI.

[31]  Xin Xu,et al.  Kernel Least-Squares Temporal Difference Learning , 2006 .

[32]  Xin Xu,et al.  Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[33]  T. Jung,et al.  Kernelizing LSPE(λ) , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[34]  Daniel Polani,et al.  Learning RoboCup-Keepaway with Kernels , 2007, Gaussian Processes in Practice.

[35]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[36]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[37]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[38]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[39]  Christos Dimitrakakis,et al.  Rollout sampling approximate policy iteration , 2008, Machine Learning.

[40]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[41]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[42]  Gavin Taylor,et al.  Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[43]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[44]  Lihong Li,et al.  Online exploration in least-squares policy iteration , 2009, AAMAS.

[45]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[46]  Damien Ernst,et al.  Using prior knowledge to accelerate online least-squares policy iteration , 2010, 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (AQTR).

[47]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[48]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[49]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[50]  Bart De Schutter,et al.  Online least-squares policy iteration for reinforcement learning control , 2010, Proceedings of the 2010 American Control Conference.

[51]  B. Scherrer,et al.  Least-Squares Policy Iteration: Bias-Variance Trade-off in Control Problems , 2010, ICML.

[52]  Alessandro Lazaric,et al.  Finite-sample Analysis of Bellman Residual Minimization , 2010, ACML.

[53]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[54]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[55]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[56]  Bart De Schutter,et al.  Approximate dynamic programming with a fuzzy parameterization , 2010, Autom..

[57]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[58]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[59]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[60]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[61]  Dimitri P. Bertsekas,et al.  Temporal Difference Methods for General Projected Equations , 2011, IEEE Transactions on Automatic Control.

[62]  张翔 Analysis of the class , 2013 .