Natural actor-critic algorithms

We present four new reinforcement learning algorithms based on actor-critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods based on policy gradients in this way are of special interest because of their compatibility with function-approximation methods, which are needed to handle large or infinite state spaces. The use of temporal difference learning in this way is of special interest because in many applications it dramatically reduces the variance of the gradient estimates. The use of the natural gradient is of interest because it can produce better conditioned parameterizations and has been shown to further reduce variance in some cases. Our results extend prior two-timescale convergence results for actor-critic methods by Konda and Tsitsiklis by using temporal difference learning in the actor and by incorporating natural gradients. Our results extend prior empirical studies of natural actor-critic methods by Peters, Vijayakumar and Schaal by providing the first convergence proofs and the first fully incremental algorithms.

[1]  John Rust Numerical dynamic programming in economics , 1996 .

[2]  Peter Dayan,et al.  Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[3]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[4]  Shalabh Bhatnagar,et al.  Adaptive Newton-based multivariate smoothed functional algorithms for simulation optimization , 2007, TOMC.

[5]  V. Borkar Recursive self-tuning control of finite Markov chains , 1997 .

[6]  Michael I. Jordan,et al.  On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001 .

[7]  Morris W. Hirsch,et al.  Convergent activation dynamics in continuous time networks , 1989, Neural Networks.

[8]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[9]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[10]  S. Andradóttir,et al.  A Simulated Annealing Algorithm with Constant Temperature for Discrete Stochastic Optimization , 1999 .

[11]  Solomon Lefschetz,et al.  Stability by Liapunov's Direct Method With Applications , 1962 .

[12]  Shalabh Bhatnagar,et al.  Reinforcement Learning Based Algorithms for Average Cost Markov Decision Processes , 2007, Discret. Event Dyn. Syst..

[13]  Shalabh Bhatnagar,et al.  Natural actorcritic algorithms. , 2009 .

[14]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[15]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Samy Bengio,et al.  Variance Reduction Techniques in . . . , 2003 .

[18]  Odile Brandière,et al.  Some Pathological Traps for Stochastic Approximation , 1998 .

[19]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[20]  Jonathan Baxter KnightCap : A chess program that learns by combining TD ( ) with game-tree search , 1998 .

[21]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[22]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[23]  Vladislav Tadic,et al.  On the Convergence of Temporal-Difference Learning with Linear Function Approximation , 2001, Machine Learning.

[24]  A. Barto,et al.  Improved Temporal Difference Methods with Linear Function Approximation , 2004 .

[25]  Shalabh Bhatnagar,et al.  A simultaneous perturbation stochastic approximation-based actor-critic algorithm for Markov decision processes , 2004, IEEE Transactions on Automatic Control.

[26]  J. Spall STOCHASTIC OPTIMIZATION , 2002 .

[27]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[28]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[29]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[30]  Dirk Henkemans,et al.  C++ programming for the absolute beginner , 2001 .

[31]  Robert M. Glorioso,et al.  Engineering Cybernetics , 1975 .

[32]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[33]  Benjamin Van Roy,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[34]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[35]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[36]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[37]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[38]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[39]  D. J. White,et al.  A Survey of Applications of Markov Decision Processes , 1993 .

[40]  Sridhar Mahadevan,et al.  Hierarchical Policy Gradient Algorithms , 2003, ICML.

[41]  R. Bellman,et al.  FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .

[42]  Abraham Thomas,et al.  LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES , 2009 .

[43]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[44]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[45]  Andrew Tridgell,et al.  KnightCap: A Chess Programm That Learns by Combining TD(lambda) with Game-Tree Search , 1998, ICML.

[46]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[47]  Shalabh Bhatnagar,et al.  Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization , 2005, TOMC.

[48]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[49]  K. I. M. McKinnon,et al.  On the Generation of Markov Decision Processes , 1995 .

[50]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[51]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[52]  BhatnagarShalabh,et al.  Natural actor-critic algorithms , 2009 .

[53]  Odile Brandi Ere SOME PATHOLOGICAL TRAPS FOR STOCHASTIC APPROXIMATION , 1998 .

[54]  Peter L. Bartlett,et al.  Experiments with Infinite-Horizon, Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[55]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[56]  Vivek S. Borkar,et al.  Reinforcement Learning — A Bridge Between Numerical Methods and Monte Carlo , 2009 .

[57]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[58]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[59]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[60]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1995 .

[61]  D. Rogers,et al.  Variance-Reduction Techniques , 1988 .

[62]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[63]  Shalabh Bhatnagar,et al.  Incremental Natural Actor-Critic Algorithms , 2007, NIPS.

[64]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[65]  M. Narasimha Murty,et al.  Information theoretic justification of Boltzmann selection and its generalization to Tsallis case , 2005, 2005 IEEE Congress on Evolutionary Computation.

[66]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[67]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[68]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[69]  Sean P. Meyn Control Techniques for Complex Networks: Workload , 2007 .

[70]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[71]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[72]  James W. Daniel,et al.  Splines and efficiency in dynamic programming , 1976 .

[73]  Thomas Hofmann,et al.  Natural Actor-Critic for Road Traffic Optimisation , 2007 .

[74]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[75]  M. Kurano LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES , 1987 .

[76]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[77]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[78]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[79]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[80]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[81]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[82]  R. Pemantle,et al.  Nonconvergence to Unstable Points in Urn Models and Stochastic Approximations , 1990 .

[83]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[84]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[85]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[86]  Mohammad Ghavamzadeh,et al.  Bayesian Policy Gradient Algorithms , 2006, NIPS.

[87]  V. Borkar Stochastic approximation with two time scales , 1997 .

[88]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[89]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[90]  J. Tsitsiklis,et al.  An optimal one-way multigrid algorithm for discrete-time stochastic control , 1991 .