Value Iteration in Continuous Actions, States and Time

Classical value iteration approaches are not applicable to environments with continuous states and actions. For such environments, the states and actions are usually discretized, which leads to an exponential increase in computational complexity. In this paper, we propose continuous fitted value iteration (cFVI). This algorithm enables dynamic programming for continuous states and actions with a known dynamics model. Leveraging the continuous-time formulation, the optimal policy can be derived for non-linear control-affine dynamics. This closed-form solution enables the efficient extension of value iteration to continuous environments. We show in non-linear control experiments that the dynamic programming solution obtains the same quantitative performance as deep reinforcement learning methods in simulation but excels when transferred to the physical system. The policy obtained by cFVI is more robust to changes in the dynamics despite using only a deterministic model and without explicitly incorporating robustness in the optimization. Videos of the physical system are available at https://sites. google.com/view/value-iteration.

[1]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[2]  Silvio Savarese,et al.  Adversarially Robust Policy Learning through Active Construction of Physically-Plausible Perturbations , 2017 .

[3]  Sicun Gao,et al.  Neural Lyapunov Control , 2020, NeurIPS.

[4]  Andreas Krause,et al.  The Lyapunov Neural Network: Adaptive Stability Certification for Safe Learning of Dynamical Systems , 2018, CoRL.

[5]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[6]  Andrea Bonarini,et al.  MushroomRL: Simplifying Reinforcement Learning Research , 2020, J. Mach. Learn. Res..

[7]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[8]  Paris Perdikaris,et al.  Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations , 2017, ArXiv.

[9]  Jan Peters,et al.  HJB Optimal Feedback Control with Deep Differential Value Functions and Action Constraints , 2019, CoRL.

[10]  Xingye Da,et al.  Dynamics Randomization Revisited: A Case Study for Quadrupedal Locomotion , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[12]  S. Lyshevski Optimal control of nonlinear continuous-time systems: design of bounded controllers via generalized nonquadratic functionals , 1998, Proceedings of the 1998 American Control Conference. ACC (IEEE Cat. No.98CH36207).

[13]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[14]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[15]  J. Zico Kolter,et al.  Learning Stable Deep Dynamics Models , 2020, NeurIPS.

[16]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[17]  Philipp Hennig,et al.  Optimal Reinforcement Learning for Gaussian Systems , 2011, NIPS.

[18]  Christian Stöcker,et al.  Event-Based Control , 2014 .

[19]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[20]  Jeongho Kim,et al.  Hamilton-Jacobi Deep Q-Learning for Deterministic Continuous-Time Systems with Lipschitz Continuous Controls , 2020, J. Mach. Learn. Res..

[21]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[22]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: Benchmarks for Offline Reinforcement Learning , 2020, ArXiv.

[23]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[24]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[25]  Byron Boots,et al.  Learning a Contact-Adaptive Controller for Robust, Efficient Legged Locomotion , 2020, Conference on Robot Learning.

[26]  Donald E. Kirk,et al.  Optimal control theory : an introduction , 1970 .

[27]  Jan Peters,et al.  Domain Randomization for Simulation-Based Policy Optimization with Transferability Assessment , 2018, CoRL.

[28]  Karthikeyan Rajagopal,et al.  Neural Network-Based Solutions for Stochastic Optimal Control Using Path Integrals , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[30]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[31]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[32]  Sergey Levine,et al.  Conservative Safety Critics for Exploration , 2021, ICLR.

[33]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[34]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[35]  Evangelos A. Theodorou,et al.  Safe Optimal Control Using Stochastic Barrier Functions and Deep Forward-Backward SDEs , 2020, ArXiv.

[36]  H. Kappen Linear theory for control of nonlinear stochastic systems. , 2004, Physical review letters.

[37]  George Em Karniadakis,et al.  Hidden fluid mechanics: Learning velocity and pressure fields from flow visualizations , 2020, Science.

[38]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[39]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[40]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[41]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[42]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[43]  Paris Perdikaris,et al.  Physics Informed Deep Learning (Part II): Data-driven Discovery of Nonlinear Partial Differential Equations , 2017, ArXiv.

[44]  Dieter Fox,et al.  BayesSim: adaptive domain randomization via probabilistic inference for robotics simulators , 2019, Robotics: Science and Systems.

[45]  P. Olver Nonlinear Systems , 2013 .

[46]  K. Jarrod Millman,et al.  Array programming with NumPy , 2020, Nat..

[47]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[48]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[49]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[50]  Yuval Tassa,et al.  Least Squares Solutions of the HJB Equation With Neural Network Value-Function Approximators , 2007, IEEE Transactions on Neural Networks.

[51]  Evangelos A. Theodorou,et al.  Learning Deep Stochastic Optimal Control Policies Using Forward-Backward SDEs , 2019, Robotics: Science and Systems.

[52]  Silvio Savarese,et al.  ADAPT: Zero-Shot Adaptive Policy Transfer for Stochastic Dynamical Systems , 2017, ISRR.

[53]  Yunpeng Pan,et al.  Model-based Path Integral Stochastic Control: A Bayesian Nonparametric Approach , 2014, ArXiv.

[54]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[55]  Xiong Yang,et al.  Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints , 2014, Int. J. Control.

[56]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.