Projections for Approximate Policy Iteration Algorithms

Approximate policy iteration is a class of reinforcement learning (RL) algorithms where the policy is encoded using a function approximator and which has been especially prominent in RL with continuous action spaces. In this class of RL algorithms, ensuring increase of the policy return during policy update often requires to constrain the change in action distribution. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms.

[1]  Jian Zhang,et al.  Structured Control Nets for Deep Reinforcement Learning , 2018, ICML.

[2]  Yuval Tassa,et al.  Simulation tools for model-based robotics: Comparison of Bullet, Havok, MuJoCo, ODE and PhysX , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Bruno Castro da Silva,et al.  Learning Parameterized Skills , 2012, ICML.

[4]  E. Todorov,et al.  A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems , 2005, Proceedings of the 2005, American Control Conference, 2005..

[5]  Luís Paulo Reis,et al.  Model-Based Relative Entropy Stochastic Search , 2016, NIPS.

[6]  Masashi Sugiyama,et al.  Guide Actor-Critic for Continuous Control , 2017, ICLR.

[7]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[8]  Yuval Tassa,et al.  Control-limited differential dynamic programming , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[10]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[11]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[12]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[13]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[14]  Jan Peters,et al.  Local Bayesian Optimization of Motor Skills , 2017, ICML.

[15]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[16]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[17]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[18]  Shie Mannor,et al.  Shallow Updates for Deep Reinforcement Learning , 2017, NIPS.

[19]  Jan Peters,et al.  Reinforcement learning vs human programming in tetherball robot games , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[21]  Martha White,et al.  Two-Timescale Networks for Nonlinear Value Function Approximation , 2019, ICLR.

[22]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Philip S. Thomas,et al.  A Notation for Markov Decision Processes , 2015, ArXiv.

[25]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[26]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[28]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[29]  Mingjun Zhong,et al.  Efficient Gradient-Free Variational Inference using Policy Search , 2018, ICML.

[30]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[31]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[32]  Jan Peters,et al.  Model-Free Trajectory-based Policy Optimization with Monotonic Improvement , 2016, J. Mach. Learn. Res..