HJB Optimal Feedback Control with Deep Differential Value Functions and Action Constraints

Learning optimal feedback control laws capable of executing optimal trajectories is essential for many robotic applications. Such policies can be learned using reinforcement learning or planned using optimal control. While reinforcement learning is sample inefficient, optimal control only plans an optimal trajectory from a specific starting configuration. In this paper we propose deep optimal feedback control to learn an optimal feedback policy rather than a single trajectory. By exploiting the inherent structure of the robot dynamics and strictly convex action cost, we can derive principled cost functions such that the optimal policy naturally obeys the action limits, is globally optimal and stable on the training domain given the optimal value function. The corresponding optimal value function is learned end-to-end by embedding a deep differential network in the Hamilton-Jacobi-Bellmann differential equation and minimizing the error of this equality while simultaneously decreasing the discounting from short- to far-sighted to enable the learning. Our proposed approach enables us to learn an optimal feedback control law in continuous time, that in contrast to existing approaches generates an optimal trajectory from any point in state-space without the need of replanning. The resulting approach is evaluated on non-linear systems and achieves optimal feedback control, where standard optimal control methods require frequent replanning.

[1]  Eugene L. Allgower,et al.  Numerical continuation methods - an introduction , 1990, Springer series in computational mathematics.

[2]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[3]  S. Lyshevski Optimal control of nonlinear continuous-time systems: design of bounded controllers via generalized nonquadratic functionals , 1998, Proceedings of the 1998 American Control Conference. ACC (IEEE Cat. No.98CH36207).

[4]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[5]  G. Goodwin,et al.  Elucidation of the state-space regions wherein model predictive control and anti-windup strategies achieve identical control policies , 2000, Proceedings of the 2000 American Control Conference. ACC (IEEE Cat. No.00CH36334).

[6]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[7]  Frank L. Lewis,et al.  Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach , 2005, Autom..

[8]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[9]  Yuval Tassa,et al.  Least Squares Solutions of the HJB Equation With Neural Network Value-Function Approximators , 2007, IEEE Transactions on Neural Networks.

[10]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[11]  Yuval Tassa,et al.  Control-limited differential dynamic programming , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[12]  ShiNung Ching,et al.  Quasilinear Control: Performance Analysis and Design of Feedback Systems with Nonlinear Sensors and Actuators , 2010 .

[13]  Zoran Popovic,et al.  Discovery of complex behaviors through contact-invariant optimization , 2012, ACM Trans. Graph..

[14]  Daniel Liberzon,et al.  Calculus of Variations and Optimal Control Theory: A Concise Introduction , 2012 .

[15]  Zoran Popovic,et al.  Contact-invariant optimization for hand manipulation , 2012, SCA '12.

[16]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[17]  Derong Liu,et al.  Neural-Network-Based Online HJB Solution for Optimal Robust Guaranteed Cost Control of Continuous-Time Uncertain Nonlinear Systems , 2014, IEEE Transactions on Cybernetics.

[18]  Xiong Yang,et al.  Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints , 2014, Int. J. Control.

[19]  F. Lewis,et al.  Online solution of nonquadratic two‐player zero‐sum games arising in the H ∞  control of constrained input systems , 2014 .

[20]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[21]  Marc Toussaint,et al.  Differentiable Physics and Stable Modes for Tool-Use and Manipulation Planning , 2018, Robotics: Science and Systems.

[22]  George E. Karniadakis,et al.  Hidden physics models: Machine learning of nonlinear partial differential equations , 2017, J. Comput. Phys..

[23]  OpenAI Learning Dexterous In-Hand Manipulation. , 2018 .

[24]  Kim D. Listmann,et al.  Deep Lagrangian Networks for end-to-end learning of energy-based control for under-actuated systems , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  Moritz Diehl,et al.  CasADi: a software framework for nonlinear optimization and optimal control , 2018, Mathematical Programming Computation.

[26]  Jan Peters,et al.  Deep Lagrangian Networks: Using Physics as Model Prior for Deep Learning , 2019, ICLR.

[27]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..