Model Predictive Actor-Critic: Accelerating Robot Skill Acquisition with Deep Reinforcement Learning

Substantial advancements to model-based reinforcement learning algorithms have been impeded by the model-bias induced by the collected data, which generally hurts performance. Meanwhile, their inherent sample efficiency warrants utility for most robot applications, limiting potential damage to the robot and its environment during training. Inspired by information theoretic model predictive control and advances in deep reinforcement learning, we introduce Model Predictive Actor-Critic (MoPAC)†, a hybrid model-based/model-free method that combines model predictive rollouts with policy optimization as to mitigate model bias. MoPAC leverages optimal trajectories to guide policy learning, but explores via its model-free method, allowing the algorithm to learn more expressive dynamics models. This combination guarantees optimal skill learning up to an approximation error and reduces necessary physical interaction with the environment, making it suitable for real-robot training. We provide extensive results showcasing how our proposed method generally outperforms current state-of-the-art and conclude by evaluating MoPAC for learning on a physical robotic hand performing valve rotation and finger gaiting–a task that requires grasping, manipulation, and then regrasping of an object.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[3]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[4]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[5]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[6]  Sergey Levine,et al.  Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[7]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[8]  James M. Rehg,et al.  Aggressive driving with model predictive path integral control , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Sergey Levine,et al.  Optimal control with learned local models: Application to dexterous manipulation , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[11]  Tamim Asfour,et al.  Model-Based Reinforcement Learning via Meta-Policy Optimization , 2018, CoRL.

[12]  Byron Boots,et al.  Differentiable MPC for End-to-end Planning and Control , 2018, NeurIPS.

[13]  Christopher G. Atkeson,et al.  Using Local Trajectory Optimizers to Speed Up Global Optimization in Dynamic Programming , 1993, NIPS.

[14]  Gaurav S. Sukhatme,et al.  Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning , 2017, ICML.

[15]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[16]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[17]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[18]  Sergey Levine,et al.  Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[19]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[20]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[21]  Aaron M. Dollar,et al.  Towards Generalized Manipulation Learning Through Grasp Mechanics-Based Features and Self-Supervision , 2021, IEEE Transactions on Robotics.

[22]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[24]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  Nolan Wagener,et al.  Information theoretic MPC for model-based reinforcement learning , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Sergey Levine,et al.  Composable Deep Reinforcement Learning for Robotic Manipulation , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28]  Evangelos Theodorou,et al.  MPC-Inspired Neural Network Policies for Sequential Decision Making , 2018, ArXiv.

[29]  Petros Maragos,et al.  Learn to Adapt to Human Walking: A Model-Based Reinforcement Learning Approach for a Robotic Assistant Rollator , 2019, IEEE Robotics and Automation Letters.

[30]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[31]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[32]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[33]  Byron Boots,et al.  Information Theoretic Model Predictive Q-Learning , 2020, L4DC.

[34]  Aaron M. Dollar,et al.  An underactuated hand for efficient finger-gaiting-based dexterous manipulation , 2014, 2014 IEEE International Conference on Robotics and Biomimetics (ROBIO 2014).

[35]  Edwin Olson,et al.  AprilTag 2: Efficient and robust fiducial detection , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[36]  Kaiyu Hang,et al.  Object-Agnostic Dexterous Manipulation of Partially Constrained Trajectories , 2020, IEEE Robotics and Automation Letters.

[37]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[38]  Jimmy Ba,et al.  Exploring Model-based Planning with Policy Networks , 2019, ICLR.

[39]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[40]  Marco Hutter,et al.  Deep Value Model Predictive Control , 2019, CoRL.

[41]  Pieter Abbeel,et al.  Model-Augmented Actor-Critic: Backpropagating through Paths , 2020, ICLR.

[42]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[43]  Siddhartha S. Srinivasa,et al.  Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set , 2015, IEEE Robotics & Automation Magazine.

[44]  William D. Smart,et al.  Receding Horizon Differential Dynamic Programming , 2007, NIPS.