Reward Constrained Policy Optimization

Solving tasks in Reinforcement Learning is no easy feat. As the goal of the agent is to maximize the accumulated reward, it often learns to exploit loopholes and misspecifications in the reward signal resulting in unwanted behavior. While constraints may solve this issue, there is no closed form solution for general constraints. In this work we present a novel multi-timescale approach for constrained policy optimization, called `Reward Constrained Policy Optimization' (RCPO), which uses an alternative penalty signal to guide the policy towards a constraint satisfying one. We prove the convergence of our approach and provide empirical evidence of its ability to train constraint satisfying policies.

[1]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[2]  Laurent Orseau,et al.  AI Safety Gridworlds , 2017, ArXiv.

[3]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[4]  Mohammad Ghavamzadeh,et al.  Variance-constrained actor-critic algorithms for discounted and average reward MDPs , 2014, Machine Learning.

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Leandros Tassiulas,et al.  Control and optimization meet the smart power grid: scheduling of power demands for optimal energy management , 2010, e-Energy.

[7]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[8]  Prashanth L.A,et al.  Policy Gradients for CVaR-Constrained MDPs , 2014, 1405.2690.

[9]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[10]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[11]  P. Krokhmal,et al.  Portfolio optimization with conditional value-at-risk objective and constraints , 2001 .

[12]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[13]  Shalabh Bhatnagar,et al.  An Online Actor–Critic Algorithm with Function Approximation for Constrained Markov Decision Processes , 2012, J. Optim. Theory Appl..

[14]  Shie Mannor,et al.  Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach , 2015, NIPS.

[15]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[16]  Michael I. Jordan,et al.  First-order methods almost always avoid saddle points: The case of vanishing step-sizes , 2019, NeurIPS.

[17]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[18]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[19]  M. T. Wasan Stochastic Approximation , 1969 .

[20]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[21]  E. Altman Constrained Markov Decision Processes , 1999 .

[22]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[23]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[24]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[25]  Ann Nowé,et al.  Multi-objective reinforcement learning using sets of pareto dominating policies , 2014, J. Mach. Learn. Res..

[26]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[27]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[28]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[29]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Qianchuan Zhao,et al.  Optimization of Web Service-Based Control System for Balance Between Network Traffic and Delay , 2018, IEEE Transactions on Automation Science and Engineering.

[31]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[32]  Honglak Lee,et al.  Efficient L1 Regularized Logistic Regression , 2006, AAAI.

[33]  Mohammad Ghavamzadeh,et al.  Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[34]  Vivek S. Borkar,et al.  An actor-critic algorithm for constrained Markov decision processes , 2005, Syst. Control. Lett..

[35]  A. PrashanthL. Policy Gradients for CVaR-Constrained MDPs , 2014, ALT.

[36]  Yuval Tassa,et al.  Safe Exploration in Continuous Action Spaces , 2018, ArXiv.

[37]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[38]  Shie Mannor,et al.  Variance Adjusted Actor Critic Algorithms , 2013, ArXiv.