Approximate real-time optimal control based on sparse Gaussian process models

In this paper we present a fully automated approach to (approximate) optimal control of non-linear systems. Our algorithm jointly learns a non-parametric model of the system dynamics - based on Gaussian Process Regression (GPR) - and performs receding horizon control using an adapted iterative LQR formulation. This results in an extremely data-efficient learning algorithm that can operate under real-time constraints. When combined with an exploration strategy based on GPR variance, our algorithm successfully learns to control two benchmark problems in simulation (two-link manipulator, cart-pole) as well as to swing-up and balance a real cart-pole system. For all considered problems learning from scratch, that is without prior knowledge provided by an expert, succeeds in less than 10 episodes of interaction with the system.

[1]  William D. Smart,et al.  Receding Horizon Differential Dynamic Programming , 2007, NIPS.

[2]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[3]  Carl E. Rasmussen,et al.  Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[4]  Marc Peter Deisenroth,et al.  Efficient reinforcement learning using Gaussian processes , 2010 .

[5]  Richard L. Lewis,et al.  Variance-Based Rewards for Approximate Bayesian Reinforcement Learning , 2010, UAI.

[6]  Pieter Abbeel,et al.  Learning vehicular dynamics, with application to modeling helicopters , 2005, NIPS.

[7]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[8]  Sethu Vijayakumar,et al.  Optimal Control with Adaptive Internal Dynamics Models , 2008, ICINCO-ICSO.

[9]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[10]  Pieter Abbeel,et al.  Learning for control from multiple demonstrations , 2008, ICML '08.

[11]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[12]  Nicholas Roy,et al.  Real-Time Inverse Dynamics Learning for Musculoskeletal Robots Based on Echo State Gaussian Process Regression , 2013 .

[13]  Raymond A. de Callafon,et al.  Optimal trade-off between exploration and exploitation , 2008, 2008 American Control Conference.

[14]  Carl E. Rasmussen,et al.  Derivative Observations in Gaussian Process Models of Dynamic Systems , 2002, NIPS.

[15]  Jan Peters,et al.  Model Learning with Local Gaussian Process Regression , 2009, Adv. Robotics.

[16]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[17]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[18]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[19]  Sergey Levine,et al.  Learning Complex Neural Network Policies with Trajectory Optimization , 2014, ICML.

[20]  Emanuel Todorov,et al.  Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems , 2004, ICINCO.

[21]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[22]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[23]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[24]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[25]  Jan Peters,et al.  Model learning for robot control: a survey , 2011, Cognitive Processing.

[26]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[27]  Stefan Schaal,et al.  Learning Control in Robotics , 2010, IEEE Robotics & Automation Magazine.

[28]  Martin A. Riedmiller,et al.  Reinforcement learning in feedback control , 2011, Machine Learning.