Optimism-driven exploration for nonlinear systems

Tasks with unknown dynamics and costly system interaction time present a serious challenge for reinforcement learning. If a model of the dynamics can be learned quickly, interaction time can be reduced substantially. We show that combining an optimistic exploration strategy with model-predictive control can achieve very good sample complexity for a range of nonlinear systems. Our method learns a Dirichlet process mixture of linear models using an exploration strategy based on optimism in the face of uncertainty. Trajectory optimization is used to plan paths in the learned model that both minimize the cost and perform exploration. Experimental results show that our approach achieves some of the most sample-efficient learning rates on several benchmark problems, and is able to successfully learn to control a simulated helicopter during hover and autorotation with only seconds of interaction time. The computational requirements are substantial.

[1]  L. Lasdon,et al.  Nonlinear Optimization by Successive Linear Programming , 1982 .

[2]  Manfred Morari,et al.  Model predictive control: Theory and practice - A survey , 1989, Autom..

[3]  Karl Johan Åström,et al.  Adaptive Control , 1989, Embedded Digital Control with Microcontrollers.

[4]  R. Bodson,et al.  Multivariable adaptive algorithms for reconfigurable flight control , 1994, Proceedings of 1994 33rd IEEE Conference on Decision and Control.

[5]  Jay H. Lee,et al.  Model predictive control: past, present and future , 1999 .

[6]  Eduardo F. Camacho,et al.  Introduction to Model Based Predictive Control , 1999 .

[7]  Anil V. Rao,et al.  Direct Trajectory Optimization and Costate Estimation via an Orthogonal Collocation Method , 2006 .

[8]  Michael I. Jordan,et al.  Variational inference for Dirichlet process mixtures , 2006 .

[9]  Ambuj Tewari,et al.  Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs , 2007, NIPS.

[10]  Pieter Abbeel,et al.  Autonomous Autorotation of an RC Helicopter , 2008, ISER.

[11]  John T. Betts,et al.  Practical Methods for Optimal Control and Estimation Using Nonlinear Programming , 2009 .

[12]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[13]  Shimon Whiteson,et al.  Neuroevolutionary reinforcement learning for generalized helicopter control , 2009, GECCO.

[14]  Aude Billard,et al.  BM: An iterative algorithm to learn stable non-linear dynamical systems with Gaussian mixture models , 2010, 2010 IEEE International Conference on Robotics and Automation.

[15]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[16]  Tilo Strutz,et al.  Data Fitting and Uncertainty: A practical introduction to weighted least squares and beyond , 2010 .

[17]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[18]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[19]  Csaba Szepesvári,et al.  Regret Bounds for the Adaptive Control of Linear Quadratic Systems , 2011, COLT.

[20]  Ian R. Manchester,et al.  Stable dynamic walking over uneven terrain , 2011, Int. J. Robotics Res..

[21]  Olivier Buffet,et al.  Near-Optimal BRL using Optimistic Local Transitions , 2012, ICML.

[22]  Yiannis Demiris,et al.  A nonparametric Bayesian approach toward robot learning by demonstration , 2012, Robotics Auton. Syst..

[23]  Claire J. Tomlin,et al.  Extensions of learning-based model predictive control for real-time application to a quadrotor helicopter , 2012, 2012 American Control Conference (ACC).

[24]  Scott Kuindersma,et al.  Variational Bayesian Optimization for Runtime Risk-Sensitive Control , 2012, Robotics: Science and Systems.

[25]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[26]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[27]  Andre Wibisono,et al.  Streaming Variational Bayes , 2013, NIPS.

[28]  Yuval Tassa,et al.  An integrated system for real-time model predictive control of humanoid robots , 2013, 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids).

[29]  Francesco Borrelli,et al.  Solving linear and quadratic programs with an analog circuit , 2014, Comput. Chem. Eng..

[30]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.