Robust Value Iteration for Continuous Control Tasks

When transferring a control policy from simulation to a physical system, the policy needs to be robust to variations in the dynamics to perform well. Commonly, the optimal policy overfits to the approximate model and the corresponding statedistribution, often resulting in failure to trasnfer underlying distributional shifts. In this paper, we present Robust Fitted Value Iteration, which uses dynamic programming to compute the optimal value function on the compact state domain and incorporates adversarial perturbations of the system dynamics. The adversarial perturbations encourage a optimal policy that is robust to changes in the dynamics. Utilizing the continuoustime perspective of reinforcement learning, we derive the optimal perturbations for the states, actions, observations and model parameters in closed-form. Notably, the resulting algorithm does not require discretization of states or actions. Therefore, the optimal adversarial perturbations can be efficiently incorporated in the min-max value function update. We apply the resulting algorithm to the physical Furuta pendulum and cartpole. By changing the masses of the systems we evaluate the quantitative and qualitative performance across different model parameters. We show that robust value iteration is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm. Videos of the experiments are shown at https://sites.google.com/view/rfvi

[1]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[2]  Vivek S. Borkar,et al.  A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[3]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[4]  Andrea Bonarini,et al.  MushroomRL: Simplifying Reinforcement Learning Research , 2020, J. Mach. Learn. Res..

[5]  H. Kappen Linear theory for control of nonlinear stochastic systems. , 2004, Physical review letters.

[6]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[7]  Derong Liu,et al.  Neural-Network-Based Online HJB Solution for Optimal Robust Guaranteed Cost Control of Continuous-Time Uncertain Nonlinear Systems , 2014, IEEE Transactions on Cybernetics.

[8]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[9]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[10]  Andreas Krause,et al.  The Lyapunov Neural Network: Adaptive Stability Certification for Safe Learning of Dynamical Systems , 2018, CoRL.

[11]  Shie Mannor,et al.  Action Robust Reinforcement Learning and Applications in Continuous Control , 2019, ICML.

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Dieter Fox,et al.  BayesSim: adaptive domain randomization via probabilistic inference for robotics simulators , 2019, Robotics: Science and Systems.

[14]  J. Zico Kolter,et al.  Learning Stable Deep Dynamics Models , 2020, NeurIPS.

[15]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[16]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[17]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[18]  Jan Peters,et al.  Domain Randomization for Simulation-Based Policy Optimization with Transferability Assessment , 2018, CoRL.

[19]  Sergey Levine,et al.  Conservative Safety Critics for Exploration , 2021, ICLR.

[20]  Shie Mannor,et al.  Scaling Up Robust MDPs by Reinforcement Learning , 2013, ArXiv.

[21]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[22]  Jan Peters,et al.  HJB Optimal Feedback Control with Deep Differential Value Functions and Action Constraints , 2019, CoRL.

[23]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[24]  Yuval Tassa,et al.  Least Squares Solutions of the HJB Equation With Neural Network Value-Function Approximators , 2007, IEEE Transactions on Neural Networks.

[25]  Evangelos A. Theodorou,et al.  Learning Deep Stochastic Optimal Control Policies Using Forward-Backward SDEs , 2019, Robotics: Science and Systems.

[26]  Shie Mannor,et al.  Value Iteration in Continuous Actions, States and Time , 2021, ICML.

[27]  Evangelos A. Theodorou,et al.  Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward SDEs , 2020, ArXiv.

[28]  Silvio Savarese,et al.  ADAPT: Zero-Shot Adaptive Policy Transfer for Stochastic Dynamical Systems , 2017, ISRR.

[29]  Philipp Hennig,et al.  Optimal Reinforcement Learning for Gaussian Systems , 2011, NIPS.

[30]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[31]  James Davidson,et al.  Supervision via competition: Robot adversaries for learning tasks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Shie Mannor,et al.  Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach , 2015, NIPS.

[33]  Girish Chowdhary,et al.  Robust Deep Reinforcement Learning with Adversarial Attacks , 2017, AAMAS.

[34]  Xiong Yang,et al.  Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints , 2014, Int. J. Control.

[35]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[36]  Jeongho Kim,et al.  Hamilton-Jacobi Deep Q-Learning for Deterministic Continuous-Time Systems with Lipschitz Continuous Controls , 2020, J. Mach. Learn. Res..

[37]  Jan Peters,et al.  Data-Efficient Domain Randomization With Bayesian Optimization , 2020, IEEE Robotics and Automation Letters.

[38]  Xingye Da,et al.  Dynamics Randomization Revisited: A Case Study for Quadrupedal Locomotion , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[39]  J. Barkley Rosser,et al.  ON THE FOUNDATIONS OF MATHEMATICAL ECONOMICS , 2012 .

[40]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[41]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[42]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[43]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[44]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[45]  Sergey Levine,et al.  Adversarial Policies: Attacking Deep Reinforcement Learning , 2019, ICLR.

[46]  S. Lyshevski Optimal control of nonlinear continuous-time systems: design of bounded controllers via generalized nonquadratic functionals , 1998, Proceedings of the 1998 American Control Conference. ACC (IEEE Cat. No.98CH36207).

[47]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[48]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[49]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[50]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[51]  Mo Chen,et al.  Hamilton-Jacobi reachability: A brief overview and recent advances , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[52]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[53]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[54]  Sicun Gao,et al.  Neural Lyapunov Control , 2020, NeurIPS.

[55]  Per-Olof Gutman,et al.  A Comparison between Robust and Adaptive Control of Uncertain Systems , 1987 .

[56]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[57]  Cho-Jui Hsieh,et al.  Robust Deep Reinforcement Learning against Adversarial Perturbations on Observations , 2020, ArXiv.

[58]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[59]  Silvio Savarese,et al.  Adversarially Robust Policy Learning: Active construction of physically-plausible perturbations , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[60]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[61]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.