论文信息 - Learning from Demonstrations for Real World Reinforcement Learning - 字舞流文

Learning from Demonstrations for Real World Reinforcement Learning

Deep reinforcement learning (RL) has achieved several high profile successes in difficult decision-making problems. However, these algorithms typically require a huge amount of data before they reach reasonable performance. In fact, their performance during learning can be extremely poor. This may be acceptable for a simulator, but it severely limits the applicability of deep RL to many real-world tasks, where the agent must learn in the real environment. In this paper we study a setting where the agent may access data from previous control of the system. We present an algorithm, Deep Q-learning from Demonstrations (DQfD), that leverages this data to massively accelerate the learning process even from relatively small amounts of demonstration data and is able to automatically assess the necessary ratio of demonstration data while learning thanks to a prioritized replay mechanism. DQfD works by combining temporal difference updates with supervised classification of the demonstrator’s actions. We show that DQfD has better initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN) as it starts with better scores on the first million steps on 41 of 42 games and on average it takes PDD DQN 82 million steps to catch up to DQfD’s performance. DQfD learns to out-perform the best demonstration given in 14 of 42 games. In addition, DQfD leverages human demonstrations to achieve state-ofthe-art results for 17 games. Finally, we show that DQfD performs better than three related algorithms for incorporating demonstration data into DQN.

Tom Schaul | Andrew Sendonaris | Joel Z. Leibo | Olivier Pietquin | Gabriel Dulac-Arnold | Ian Osband | Todd Hester | Marc Lanctot | Bilal Piot | Matej Vecerík | John Agapiou | Audrunas Gruslys | J. Agapiou | T. Schaul | Matej Vecerík | Ian Osband | Bilal Piot | Marc Lanctot | A. Gruslys | Todd Hester | O. Pietquin | A. Sendonaris | Gabriel Dulac-Arnold

[1] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[2] Guy Shani,et al. An MDP-Based Recommender System , 2002, J. Mach. Learn. Res..

[3] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[4] Pieter Abbeel,et al. An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[5] Robert E. Schapire,et al. A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[6] Michael H. Bowling,et al. Apprenticeship learning using linear programming , 2008, ICML '08.

[7] Geoffrey J. Gordon,et al. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[8] Eduardo F. Morales,et al. An Introduction to Reinforcement Learning , 2011 .

[9] Michael L. Littman,et al. Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[10] Sonia Chernova,et al. Integrating reinforcement learning with human demonstrations of varying ability , 2011, AAMAS.

[11] Peter Stone,et al. TEXPLORE: real-time sample-efficient reinforcement learning for robots , 2012, Machine Learning.

[12] Joelle Pineau,et al. Learning from Limited Demonstrations , 2013, NIPS.

[13] Matthieu Geist,et al. Learning from Demonstrations: Is It Worth Estimating a Reward Function? , 2013, ECML/PKDD.

[14] Matthieu Geist,et al. Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[15] Matthieu Geist,et al. Boosted and reward-regularized classification for apprenticeship learning , 2014, AAMAS.

[16] Alessandro Lazaric,et al. Direct Policy Iteration with Demonstrations , 2015, IJCAI.

[17] Martin A. Riedmiller,et al. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[18] Andrea Lockerd Thomaz,et al. Policy Shaping with Human Teachers , 2015, IJCAI.

[19] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[20] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[21] Sonia Chernova,et al. Reinforcement Learning from Demonstration through Shaping , 2015, IJCAI.

[22] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[23] Sonia Chernova,et al. Learning from Demonstration for Shaping through Inverse Reinforcement Learning , 2016, AAMAS.

[24] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[25] David Silver,et al. Learning values across many orders of magnitude , 2016, NIPS.

[26] Stefano Ermon,et al. Generative Adversarial Imitation Learning , 2016, NIPS.

[27] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[28] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[29] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[30] Traian Rebedea,et al. Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay , 2016, ArXiv.

[31] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[32] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[33] Tom Schaul,et al. Prioritized Experience Replay , 2015, ICLR.

[34] Andrea Lockerd Thomaz,et al. Exploration from Demonstration for Interactive Reinforcement Learning , 2016, AAMAS.

[35] Jianfeng Gao,et al. Efficient Exploration for Dialog Policy Learning with Deep BBQ Networks \& Replay Buffer Spiking , 2016, ArXiv.

[36] Marc G. Bellemare,et al. The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[37] Marc G. Bellemare,et al. Count-Based Exploration with Neural Density Models , 2017, ICML.

[38] Marcin Andrychowicz,et al. One-Shot Imitation Learning , 2017, NIPS.

[39] Stefan Schaal,et al. Learning from Demonstration , 1996, NIPS.

[40] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.