Collect & Infer - a fresh look at data-efficient Reinforcement Learning

This position paper proposes a fresh look at Reinforcement Learning (RL) from the perspective of data-efficiency. Data-efficient RL has gone through three major stages: pure on-line RL where every data-point is considered only once, RL with a replay buffer where additional learning is done on a portion of the experience, and finally transition memory based RL, where, conceptually, all transitions are stored and re-used in every update step. While inferring knowledge from all explicitly stored experience has lead to a tremendous gain in data-efficiency, the question of how this data is collected has been vastly understudied. We argue that data-efficiency can only be achieved through careful consideration of both aspects. We propose to make this insight explicit via a paradigm that we call 'Collect and Infer', which explicitly models RL as two separate but interconnected processes, concerned with data collection and knowledge inference respectively. We discuss implications of the paradigm, how its ideas are reflected in the literature, and how it can guide future research into data efficient RL.

[1]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[2]  S. Levine,et al.  Accelerating Online Reinforcement Learning with Offline Datasets , 2020, ArXiv.

[3]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[4]  Sergey Levine,et al.  OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning , 2021, ICLR.

[5]  N. Heess,et al.  Importance Weighted Policy Learning and Adaptation , 2020, 2009.04875.

[6]  Peter Stone,et al.  Reinforcement learning , 2019, Scholarpedia.

[7]  S. Levine,et al.  Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.

[8]  Yee Whye Teh,et al.  Behavior Priors for Efficient Reinforcement Learning , 2020, J. Mach. Learn. Res..

[9]  Kate Saenko,et al.  Hierarchical Actor-Critic , 2017, ArXiv.

[10]  P. Alam ‘N’ , 2021, Composites Engineering: An A–Z Guide.

[11]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[12]  Martin A. Riedmiller,et al.  Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models , 2019, CoRL.

[13]  W. Hager,et al.  and s , 2019, Shallow Water Hydraulics.

[14]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[15]  Gabriel Dulac-Arnold,et al.  Challenges of Real-World Reinforcement Learning , 2019, ArXiv.

[16]  P. Alam ‘T’ , 2021, Composites Engineering: An A–Z Guide.

[17]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[18]  Martin Lauer,et al.  Learning to dribble on a real robot by success and failure , 2008, 2008 IEEE International Conference on Robotics and Automation.

[19]  Alexander J. Smola,et al.  Meta-Q-Learning , 2020, ICLR.

[20]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[21]  Sandy H. Huang,et al.  On Multi-objective Policy Optimization as a Tool for Reinforcement Learning , 2021, ArXiv.

[22]  Georg Ostrovski,et al.  Temporally-Extended ε-Greedy Exploration , 2020, ICLR.

[23]  Yee Whye Teh,et al.  Neural probabilistic motor primitives for humanoid control , 2018, ICLR.

[24]  Sergey Levine,et al.  Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills , 2021, ICML.

[25]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[26]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[27]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[28]  Pascal Poupart,et al.  Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[29]  Sergio Gomez Colmenarejo,et al.  RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning , 2020 .

[30]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[31]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[32]  Sergey Levine,et al.  MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale , 2021, ArXiv.

[33]  David Rolnick,et al.  Experience Replay for Continual Learning , 2018, NeurIPS.

[34]  Martin A. Riedmiller,et al.  Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning , 2020, ICLR.

[35]  Yoshua Bengio,et al.  Probabilistic Planning with Sequential Monte Carlo methods , 2018, ICLR.

[36]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[37]  Martin A. Riedmiller,et al.  Compositional Transfer in Hierarchical Reinforcement Learning , 2019, Robotics: Science and Systems.

[38]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[39]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[40]  Jackie Kay,et al.  Learning Dexterous Manipulation from Suboptimal Experts , 2020, ArXiv.

[41]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[42]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[43]  Nando de Freitas,et al.  Critic Regularized Regression , 2020, NeurIPS.

[44]  Sergey Levine,et al.  COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning , 2020, ArXiv.

[45]  Martin A. Riedmiller,et al.  Reinforcement learning on explicitly specified time scales , 2003, Neural Computing & Applications.

[46]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[47]  Yee Whye Teh,et al.  Information asymmetry in KL-regularized RL , 2019, ICLR.

[48]  Nando de Freitas,et al.  Robust Imitation of Diverse Behaviors , 2017, NIPS.

[49]  P. Alam ‘S’ , 2021, Composites Engineering: An A–Z Guide.

[50]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[51]  Dushyant Rao,et al.  Data-efficient Hindsight Off-policy Option Learning , 2021, ICML.

[52]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[53]  Jost Tobias Springenberg,et al.  Simple Sensor Intentions for Exploration , 2020, ArXiv.

[54]  Karol Hausman,et al.  Learning an Embedding Space for Transferable Robot Skills , 2018, ICLR.

[55]  Lin F. Yang,et al.  Toward the Fundamental Limits of Imitation Learning , 2020, NeurIPS.

[56]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[57]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[58]  Sergey Levine,et al.  Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning? , 2019, ArXiv.

[59]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[60]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[61]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[62]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[63]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[64]  Sergey Levine,et al.  Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning , 2019, ArXiv.

[65]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[66]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[67]  Martin A. Riedmiller,et al.  Towards General and Autonomous Learning of Core Skills: A Case Study in Locomotion , 2020, CoRL.

[68]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[69]  Pieter Abbeel,et al.  Planning to Explore via Self-Supervised World Models , 2020, ICML.

[70]  P. Alam,et al.  R , 1823, The Herodotus Encyclopedia.

[71]  P. Alam ‘E’ , 2021, Composites Engineering: An A–Z Guide.

[72]  Ion Stoica,et al.  DDCO: Discovery of Deep Continuous Options for Robot Learning from Demonstrations , 2017, CoRL.

[73]  Sergey Levine,et al.  Learning Latent Plans from Play , 2019, CoRL.

[74]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[75]  Sham M. Kakade,et al.  Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[76]  Jackie Kay,et al.  Local Search for Policy Iteration in Continuous Control , 2020, ArXiv.

[77]  P. Alam ‘W’ , 2021, Composites Engineering.

[78]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[79]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[80]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.