Continuous-Time Fitted Value Iteration for Robust Policies

Solving the Hamilton-Jacobi-Bellman equation is important in many domains including control, robotics and economics. Especially for continuous control, solving this differential equation and its extension the Hamilton-Jacobi-Isaacs equation, is important as it yields the optimal policy that achieves the maximum reward on a give task. In the case of the Hamilton-Jacobi-Isaacs equation, which includes an adversary controlling the environment and minimizing the reward, the obtained policy is also robust to perturbations of the dynamics. In this paper we propose continuous fitted value iteration (cFVI) and robust fitted value iteration (rFVI). These algorithms leverage the non-linear control-affine dynamics and separable state and action reward of many continuous control problems to derive the optimal policy and optimal adversary in closed form. This analytic expression simplifies the differential equations and enables us to solve for the optimal value function using value iteration for continuous actions and states as well as the adversarial case. Notably, the resulting algorithms do not require discretization of states or actions. We apply the resulting algorithms to the Furuta pendulum and cartpole. We show that both algorithms obtain the optimal policy. The robustness Sim2Real experiments on the physical systems show that the policies successfully achieve the task in the real-world. When changing the masses of the pendulum, we observe that robust value iteration is more robust compared to deep reinforcement learning algorithm and the non-robust version of the algorithm. Videos of the experiments are shown at https://sites.google.com/view/rfvi

[1]  J. Zico Kolter,et al.  Learning Stable Deep Dynamics Models , 2020, NeurIPS.

[2]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[3]  J. Barkley Rosser,et al.  ON THE FOUNDATIONS OF MATHEMATICAL ECONOMICS , 2012 .

[4]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[5]  Silvio Savarese,et al.  ADAPT: Zero-Shot Adaptive Policy Transfer for Stochastic Dynamical Systems , 2017, ISRR.

[6]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[7]  Dieter Fox,et al.  BayesSim: adaptive domain randomization via probabilistic inference for robotics simulators , 2019, Robotics: Science and Systems.

[8]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[9]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[10]  P. Olver Nonlinear Systems , 2013 .

[11]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[12]  Shie Mannor,et al.  Robust Value Iteration for Continuous Control Tasks , 2021, Robotics: Science and Systems.

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  Mo Chen,et al.  Safe sequential path planning of multi-vehicle systems via double-obstacle Hamilton-Jacobi-Isaacs variational inequality , 2014, 2015 European Control Conference (ECC).

[15]  Jeongho Kim,et al.  Hamilton-Jacobi-Bellman Equations for Q-Learning in Continuous Time , 2020, L4DC.

[16]  Shie Mannor,et al.  Value Iteration in Continuous Actions, States and Time , 2021, ICML.

[17]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[18]  Evangelos A. Theodorou,et al.  Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward SDEs , 2020, ArXiv.

[19]  Andrea Bonarini,et al.  MushroomRL: Simplifying Reinforcement Learning Research , 2020, J. Mach. Learn. Res..

[20]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[21]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[22]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[23]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[24]  James Davidson,et al.  Supervision via competition: Robot adversaries for learning tasks , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Yunpeng Pan,et al.  Model-based Path Integral Stochastic Control: A Bayesian Nonparametric Approach , 2014, ArXiv.

[26]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[27]  Shie Mannor,et al.  Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach , 2015, NIPS.

[28]  Girish Chowdhary,et al.  Robust Deep Reinforcement Learning with Adversarial Attacks , 2017, AAMAS.

[29]  Xiong Yang,et al.  Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints , 2014, Int. J. Control.

[30]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[31]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[32]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[33]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[34]  Jeongho Kim,et al.  Hamilton-Jacobi Deep Q-Learning for Deterministic Continuous-Time Systems with Lipschitz Continuous Controls , 2020, J. Mach. Learn. Res..

[35]  Karthikeyan Rajagopal,et al.  Neural Network-Based Solutions for Stochastic Optimal Control Using Path Integrals , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[36]  Andreas Krause,et al.  The Lyapunov Neural Network: Adaptive Stability Certification for Safe Learning of Dynamical Systems , 2018, CoRL.

[37]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[38]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[39]  Mo Chen,et al.  Hamilton-Jacobi reachability: A brief overview and recent advances , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[40]  Philipp Hennig,et al.  Optimal Reinforcement Learning for Gaussian Systems , 2011, NIPS.

[41]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: Benchmarks for Offline Reinforcement Learning , 2020, ArXiv.

[42]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[43]  H. Kappen Linear theory for control of nonlinear stochastic systems. , 2004, Physical review letters.

[44]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[45]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[46]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[47]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[48]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[49]  G. Goodwin,et al.  Elucidation of the state-space regions wherein model predictive control and anti-windup strategies achieve identical control policies , 2000, Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No.00CH36334).

[50]  Vivek S. Borkar,et al.  A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[51]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[52]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[53]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[54]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[55]  ShiNung Ching,et al.  Quasilinear Control: Performance Analysis and Design of Feedback Systems with Nonlinear Sensors and Actuators , 2010 .

[56]  Jaime F. Fisac,et al.  A General Safety Framework for Learning-Based Control in Uncertain Robotic Systems , 2017, IEEE Transactions on Automatic Control.

[57]  Sergey Levine,et al.  Adversarial Policies: Attacking Deep Reinforcement Learning , 2019, ICLR.

[58]  S. Lyshevski Optimal control of nonlinear continuous-time systems: design of bounded controllers via generalized nonquadratic functionals , 1998, Proceedings of the 1998 American Control Conference. ACC (IEEE Cat. No.98CH36207).

[59]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[60]  Kim D. Listmann,et al.  Deep Lagrangian Networks for end-to-end learning of energy-based control for under-actuated systems , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[61]  Shie Mannor,et al.  Scaling Up Robust MDPs by Reinforcement Learning , 2013, ArXiv.

[62]  Jan Peters,et al.  HJB Optimal Feedback Control with Deep Differential Value Functions and Action Constraints , 2019, CoRL.

[63]  Jaime F. Fisac,et al.  Safety and Liveness Guarantees through Reach-Avoid Reinforcement Learning , 2021, Robotics: Science and Systems.

[64]  Jan Peters,et al.  Data-Efficient Domain Randomization With Bayesian Optimization , 2020, IEEE Robotics and Automation Letters.

[65]  Xingye Da,et al.  Dynamics Randomization Revisited: A Case Study for Quadrupedal Locomotion , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[66]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[67]  Yuval Tassa,et al.  Least Squares Solutions of the HJB Equation With Neural Network Value-Function Approximators , 2007, IEEE Transactions on Neural Networks.

[68]  Evangelos A. Theodorou,et al.  Learning Deep Stochastic Optimal Control Policies Using Forward-Backward SDEs , 2019, Robotics: Science and Systems.

[69]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[70]  Jan Peters,et al.  Deep Lagrangian Networks: Using Physics as Model Prior for Deep Learning , 2019, ICLR.

[71]  Sicun Gao,et al.  Neural Lyapunov Control , 2020, NeurIPS.

[72]  Amit Chakraborty,et al.  Symplectic ODE-Net: Learning Hamiltonian Dynamics with Control , 2020, ICLR.

[73]  Shie Mannor,et al.  Action Robust Reinforcement Learning and Applications in Continuous Control , 2019, ICML.

[74]  Jan Peters,et al.  Domain Randomization for Simulation-Based Policy Optimization with Transferability Assessment , 2018, CoRL.

[75]  Sergey Levine,et al.  Conservative Safety Critics for Exploration , 2021, ICLR.

[76]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[77]  Daniel Liberzon,et al.  Calculus of Variations and Optimal Control Theory: A Concise Introduction , 2012 .

[78]  Michael R. Caputo Foundations of Dynamic Economic Analysis: Optimal Control Theory and Applications , 2005 .

[79]  Silvio Savarese,et al.  Adversarially Robust Policy Learning: Active construction of physically-plausible perturbations , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[80]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[81]  Jun Morimoto,et al.  Robust Reinforcement Learning , 2005, Neural Computation.

[82]  Cho-Jui Hsieh,et al.  Robust Deep Reinforcement Learning against Adversarial Perturbations on Observations , 2020, ArXiv.