Combining Model-Based and Model-Free Updates for Deep Reinforcement Learning

The ability to learn motor skills autonomously is one of the main requirements for deploying robots in unstructured realworld environments. The goal of reinforcement learning (RL) is to learn such skills through trial and error, thus avoiding tedious manual engineering. However, real-world applications of RL have to contend with two often opposing requirements: data-efficient learning and the ability to handle complex, unknown dynamical systems that might be difficult to model explicitly. Real-world physical systems, such as robots, are typically costly and time consuming to run, making it highly desirable to learn using the lowest possible number of realworld trials. Model-based methods tend to excel at this [5], but suffer from significant bias, since complex unknown dynamics cannot always be modeled accurately enough to produce effective policies. Model-free methods have the advantage of handling arbitrary dynamical systems with minimal bias, but tend to be substantially less sample-efficient [9, 17]. Can we combine the efficiency of model-based algorithms with the final performance of model-free algorithms in a method that we can practically use on real-world physical systems? Many prior methods that combine model-free and modelbased techniques achieve only modest gains in efficiency or performance [6, 7]. In this work, we aim to develop a method in the context of a specific policy representation: time-varying linear-Gaussian controllers. The structure of these policies provides us with an effective option for model-based updates via iterative linear-Gaussian dynamics fitting [10], as well as a simple option for model-free updates via the path integral policy improvement (PI) algorithm [19]. Although time-varying linear-Gaussian (TVLG) policies are not as powerful as representations such as deep neural networks [13, 14] or RBF networks [4], they can represent arbitrary trajectories in continuous state-action spaces. Furthermore, prior work on guided policy search (GPS) has shown that TVLG policies can be used to train generalpurpose parameterized policies, including deep neural network policies, for tasks involving complex sensory inputs such as vision [10, 12]. This yields a general-purpose RL procedure with favorable stability and sample complexity compared to fully model-free deep RL methods [16]. The main contribution of this paper is a procedure for optimizing TVLG policies that integrates both fast model-

[1]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[2]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[4]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[5]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[6]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[7]  Yuval Tassa,et al.  Synthesis and stabilization of complex behaviors through online trajectory optimization , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[8]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[9]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[10]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[11]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[12]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[13]  Nolan Wagener,et al.  Learning contact-rich manipulation skills with guided policy search , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[14]  Oliver Kroemer,et al.  Learning sequential motor tasks , 2013, 2013 IEEE International Conference on Robotics and Automation.

[15]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[16]  Carl E. Rasmussen,et al.  Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning , 2011, Robotics: Science and Systems.

[17]  Sergey Levine,et al.  Reset-free guided policy search: Efficient deep reinforcement learning with stochastic initial states , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[18]  Sergey Levine,et al.  Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.