Robust Constrained Reinforcement Learning for Continuous Control with Model Misspecification

Many real-world physical control systems are required to satisfy constraints upon deployment. Furthermore, real-world systems are often subject to effects such as non-stationarity, wear-and-tear, uncalibrated sensors and so on. Such effects effectively perturb the system dynamics and can cause a policy trained successfully in one domain to perform poorly when deployed to a perturbed version of the same domain. This can affect a policy's ability to maximize future rewards as well as the extent to which it satisfies constraints. We refer to this as constrained model misspecification. We present an algorithm with theoretical guarantees that mitigates this form of misspecification, and showcase its performance in multiple Mujoco tasks from the Real World Reinforcement Learning (RWRL) suite.

[1]  H. Francis Song,et al.  A Distributional View on Multi-Objective Policy Optimization , 2020, ICML.

[2]  Shie Mannor,et al.  Exploration-Exploitation in Constrained MDPs , 2020, ArXiv.

[3]  OpenAI Learning Dexterous In-Hand Manipulation. , 2018 .

[4]  Nir Levine,et al.  An empirical investigation of the challenges of real-world reinforcement learning , 2020, ArXiv.

[5]  Pieter Abbeel,et al.  Mutual Alignment Transfer Learning , 2017, CoRL.

[6]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[8]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[9]  Raia Hadsell,et al.  Value constrained model-free continuous control , 2019, ArXiv.

[10]  Matthew W. Hoffman,et al.  Distributed Distributional Deterministic Policy Gradients , 2018, ICLR.

[11]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[12]  Shie Mannor,et al.  A Bayesian Approach to Robust Reinforcement Learning , 2019, UAI.

[13]  Yuval Tassa,et al.  DeepMind Control Suite , 2018, ArXiv.

[14]  Shie Mannor,et al.  Soft-Robust Actor-Critic Policy-Gradient , 2018, UAI.

[15]  Wojciech Zaremba,et al.  Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model , 2016, ArXiv.

[16]  Martin A. Riedmiller,et al.  Robust Reinforcement Learning for Continuous Control with Model Misspecification , 2019, ICLR.

[17]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[18]  Shie Mannor,et al.  A Deep Hierarchical Approach to Lifelong Learning in Minecraft , 2016, AAAI.

[19]  E. Altman Constrained Markov Decision Processes , 1999 .

[20]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[21]  Ramana Kumar,et al.  Scalable Neural Learning for Verifiable Consistency with Temporal Specifications , 2019 .

[22]  Jonathan P. How,et al.  Certified Adversarial Robustness for Deep Reinforcement Learning , 2019, CoRL.

[23]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[24]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[25]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[26]  Yuval Tassa,et al.  Relative Entropy Regularized Policy Iteration , 2018, ArXiv.

[27]  Junhyuk Oh,et al.  Balancing Constraints and Rewards with Meta-Gradient D4PG , 2020, ICLR.

[28]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[29]  Divyam Rastogi,et al.  Sample-efficient Reinforcement Learning via Difference Models , 2018 .

[30]  Shie Mannor,et al.  Learning Robust Options , 2018, AAAI.

[31]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[32]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[33]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.