Lyapunov Design for Safe Reinforcement Learning

Lyapunov design methods are used widely in control engineering to design controllers that achieve qualitative objectives, such as stabilizing a system or maintaining a system's state in a desired operating range. We propose a method for constructing safe, reliable reinforcement learning agents based on Lyapunov design principles. In our approach, an agent learns to control a system by switching among a number of given, base-level controllers. These controllers are designed using Lyapunov domain knowledge so that any switching policy is safe and enjoys basic performance guarantees. Our approach thus ensures qualitatively satisfactory agent behavior for virtually any reinforcement learning algorithm and at all times, including while the agent is learning and taking exploratory actions. We demonstrate the process of designing safe agents for four different control problems. In simulation experiments, we find that our theoretically motivated designs also enjoy a number of practical benefits, including reasonable performance initially and throughout learning, and accelerated learning.

[1]  R. Kalman,et al.  Control system analysis and design via the second method of lyapunov: (I) continuous-time systems (II) discrete time systems , 1959 .

[2]  R. E. Kalman,et al.  Control System Analysis and Design Via the “Second Method” of Lyapunov: I—Continuous-Time Systems , 1960 .

[3]  R. E. Kalman,et al.  Control System Analysis and Design Via the “Second Method” of Lyapunov: II—Discrete-Time Systems , 1960 .

[4]  D. Sworder,et al.  Introduction to stochastic control , 1972 .

[5]  Stephen Grossberg,et al.  Absolute stability of global pattern formation and parallel memory storage by competitive neural networks , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[6]  Francis L. Merat,et al.  Introduction to robotics: Mechanics and control , 1987, IEEE J. Robotics Autom..

[7]  Eduardo D. Sontag,et al.  Mathematical Control Theory: Deterministic Finite Dimensional Systems , 1990 .

[8]  Daniel E. Koditschek,et al.  Exact robot navigation using artificial potential functions , 1992, IEEE Trans. Robotics Autom..

[9]  W. Fleming,et al.  Controlled Markov processes and viscosity solutions , 1992 .

[10]  Vijaykumar Gullapalli,et al.  Reinforcement learning and its application to control , 1992 .

[11]  Roderic A. Grupen,et al.  Robust Reinforcement Learning in Motion Planning , 1993, NIPS.

[12]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[13]  Roderic A. Grupen,et al.  The applications of harmonic functions to robotics , 1993, J. Field Robotics.

[14]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[15]  Oren Etzioni,et al.  The First Law of Robotics (A Call to Arms) , 1994, AAAI.

[16]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[17]  Mark W. Spong,et al.  The swing up control problem for the Acrobot , 1995 .

[18]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[19]  Miroslav Krstic,et al.  Nonlinear and adaptive control de-sign , 1995 .

[20]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[21]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[22]  R. Freeman,et al.  Robust Nonlinear Control Design: State-Space and Lyapunov Techniques , 1996 .

[23]  Jeff G. Schneider,et al.  Exploiting Model Uncertainty Estimates for Safe Dynamic Control Learning , 1996, NIPS.

[24]  Gary Boone,et al.  Efficient reinforcement learning: model-based Acrobot control , 1997, Proceedings of International Conference on Robotics and Automation.

[25]  Gary Boone,et al.  Minimum-time control of the Acrobot , 1997, Proceedings of International Conference on Robotics and Automation.

[26]  Roderic A. Grupen,et al.  A feedback control structure for on-line learning tasks , 1997, Robotics Auton. Syst..

[27]  R. Grupen Learning Robot Control - Using Control Policies as Abstract Actions , 1998 .

[28]  Roderic A. Grupen,et al.  A Control Structure For Learning Locomotion Gaits , 1998 .

[29]  Eduardo D. Sontag,et al.  Mathematical control theory: deterministic finite dimensional systems (2nd ed.) , 1998 .

[30]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[31]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[32]  John N. Tsitsiklis,et al.  A survey of computational complexity results in systems and control , 2000, Autom..

[33]  Diana F. Gordon,et al.  Asimovian Adaptive Agents , 2000, J. Artif. Intell. Res..

[34]  Steven Seidman,et al.  A synthesis of reinforcement learning and robust control theory , 2000 .

[35]  H. Kushner Numerical Methods for Stochastic Control Problems in Continuous Time , 2000 .

[36]  Emilio Frazzoli,et al.  Online techniques for behavioral programming , 2000, Proceedings of the 39th IEEE Conference on Decision and Control (Cat. No.00CH37187).

[37]  Gerald DeJong,et al.  Hidden Strengths and Limitations: An Empirical Investigation of Reinforcement Learning , 2000, ICML.

[38]  Andrew G. Barto,et al.  Heuristic Search in Infinite State Spaces Guided by Lyapunov Analysis , 2001, IJCAI.

[39]  Andrew G. Barto,et al.  Lyapunov-Constrained Action Sets for Reinforcement Learning , 2001, ICML.

[40]  Theodore J. Perkins,et al.  Lyapunov methods for safe intelligent agent design , 2002 .

[41]  J. J. Hopfield,et al.  “Neural” computation of decisions in optimization problems , 1985, Biological Cybernetics.

[42]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[43]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.