Action Robust Reinforcement Learning and Applications in Continuous Control

A policy is said to be robust if it maximizes the reward while considering a bad, or even adversarial, model. In this work we formalize two new criteria of robustness to action uncertainty. Specifically, we consider two scenarios in which the agent attempts to perform an action $a$, and (i) with probability $\alpha$, an alternative adversarial action $\bar a$ is taken, or (ii) an adversary adds a perturbation to the selected action in the case of continuous action space. We show that our criteria are related to common forms of uncertainty in robotics domains, such as the occurrence of abrupt forces, and suggest algorithms in the tabular case. Building on the suggested algorithms, we generalize our approach to deep reinforcement learning (DRL) and provide extensive experiments in the various MuJoCo domains. Our experiments show that not only does our approach produce robust policies, but it also improves the performance in the absence of perturbations. This generalization indicates that action-robustness can be thought of as implicit regularization in RL problems.

[1]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[3]  M. Sion On general minimax theorems , 1958 .

[4]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .

[5]  R. Mazo On the theory of brownian motion , 1973 .

[6]  Singiresu S. Rao,et al.  Algorithms for discounted stochastic games , 1973 .

[7]  Philip D. Straffin,et al.  Game theory and strategy , 1993 .

[8]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[9]  T. Basar,et al.  H∞-0ptimal Control and Related Minimax Design Problems: A Dynamic Game Approach , 1996, IEEE Trans. Autom. Control..

[10]  Dimitri P. Bertsekas,et al.  Stochastic shortest path games: theory and algorithms , 1997 .

[11]  Stephen D. Patek,et al.  Stochastic and shortest path games: theory and algorithms , 1997 .

[12]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[13]  Eric van Damme,et al.  Non-Cooperative Games , 2000 .

[14]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[15]  Alberto Bemporad,et al.  Min-max control of constrained uncertain discrete-time linear systems , 2003, IEEE Trans. Autom. Control..

[16]  Martin Schneider,et al.  Recursive multiple-priors , 2003, J. Econ. Theory.

[17]  J. Maciejowski,et al.  Feedback min‐max model predictive control using a single linear program: robust stability and the explicit solution , 2004 .

[18]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[19]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[20]  A. Bemporad,et al.  Feedback min-max model predictive control based on a quadratic cost function , 2006, 2006 American Control Conference.

[21]  Shie Mannor,et al.  The Robustness-Performance Tradeoff in Markov Decision Processes , 2006, NIPS.

[22]  Shie Mannor,et al.  Robustness and Regularization of Support Vector Machines , 2008, J. Mach. Learn. Res..

[23]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[24]  Shie Mannor,et al.  Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty , 2012, ICML.

[25]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26]  Peter Bro Miltersen,et al.  Strategy Iteration Is Strongly Polynomial for 2-Player Turn-Based Stochastic Games with a Constant Discount Factor , 2010, JACM.

[27]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[28]  Shie Mannor,et al.  Scaling Up Robust MDPs by Reinforcement Learning , 2013, ArXiv.

[29]  Matthieu Geist,et al.  Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.

[30]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[31]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[32]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[33]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[34]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[35]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[36]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[37]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[38]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[39]  Samuel A. Barnett,et al.  Convergence Problems with Generative Adversarial Networks (GANs) , 2018, ArXiv.

[40]  Yao Zhao,et al.  Adversarial Attacks and Defences Competition , 2018, ArXiv.

[41]  Shie Mannor,et al.  Revisiting Exploration-Conscious Reinforcement Learning , 2018, ArXiv.

[42]  Mingyan Liu,et al.  Generating Adversarial Examples with Adversarial Networks , 2018, IJCAI.

[43]  Rama Chellappa,et al.  Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models , 2018, ICLR.

[44]  Shie Mannor,et al.  Exploration Conscious Reinforcement Learning Revisited , 2018, ICML.