Simple Sensor Intentions for Exploration

Modern reinforcement learning algorithms can learn solutions to increasingly difficult control problems while at the same time reduce the amount of prior knowledge needed for their application. One of the remaining challenges is the definition of reward schemes that appropriately facilitate exploration without biasing the solution in undesirable ways, and that can be implemented on real robotic systems without expensive instrumentation. In this paper we focus on a setting in which goal tasks are defined via simple sparse rewards, and exploration is facilitated via agent-internal auxiliary tasks. We introduce the idea of simple sensor intentions (SSIs) as a generic way to define auxiliary tasks. SSIs reduce the amount of prior knowledge that is required to define suitable rewards. They can further be computed directly from raw sensor streams and thus do not require expensive and possibly brittle state estimation on real systems. We demonstrate that a learning system based on these rewards can solve complex robotic tasks in simulation and in real world settings. In particular, we show that a real robotic arm can learn to grasp and lift and solve a Ball-in-a-Cup task from scratch, when only raw sensor streams are used for both controller input and in the auxiliary reward definition.

[1]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[2]  Emilio Soria Olivas,et al.  Handbook of Research on Machine Learning Applications and Trends : Algorithms , Methods , and Techniques , 2009 .

[3]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[4]  Shakir Mohamed,et al.  Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning , 2015, NIPS.

[5]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[6]  Joelle Pineau,et al.  Independently Controllable Features , 2017 .

[7]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[8]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[9]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[10]  Alex Graves,et al.  Automated Curriculum Learning for Neural Networks , 2017, ICML.

[11]  Georg Martius,et al.  Control What You Can: Intrinsically Motivated Task-Planning Agent , 2019, NeurIPS.

[12]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[13]  Thomas G. Dietterich The MAXQ Method for Hierarchical Reinforcement Learning , 1998, ICML.

[14]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[15]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[17]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[18]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[19]  Jürgen Schmidhuber,et al.  Learning skills from play: Artificial curiosity on a Katana robot arm , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[20]  Martin A. Riedmiller,et al.  Speeding-up Reinforcement Learning with Multi-step Actions , 2002, ICANN.

[21]  Gavriel Salomon,et al.  T RANSFER OF LEARNING , 1992 .

[22]  Thomas Lampe,et al.  Compositional Transfer in Hierarchical Reinforcement Learning , 2019, RSS 2020.

[23]  Murilo F. Martins,et al.  Simultaneously Learning Vision and Feature-based Control Policies for Real-world Ball-in-a-Cup , 2019, Robotics: Science and Systems.

[24]  Vladlen Koltun,et al.  Learning to Act by Predicting the Future , 2016, ICLR.

[25]  Chrystopher L. Nehaniv,et al.  All Else Being Equal Be Empowered , 2005, ECAL.

[26]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[27]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[28]  Sergey Levine,et al.  Diversity is All You Need: Learning Skills without a Reward Function , 2018, ICLR.

[29]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[30]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[31]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[32]  Samuel Gershman,et al.  Deep Successor Reinforcement Learning , 2016, ArXiv.

[33]  Pierre-Yves Oudeyer,et al.  Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning , 2017, J. Mach. Learn. Res..

[34]  Jürgen Schmidhuber,et al.  PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem , 2011, Front. Psychol..

[35]  Martin A. Riedmiller,et al.  Reinforcement learning on explicitly specified time scales , 2003, Neural Computing & Applications.

[36]  Misha Denil,et al.  The Intentional Unintentional Agent: Learning to Solve Many Continuous Control Tasks Simultaneously , 2017, CoRL.

[37]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[38]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[39]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[40]  Rui Wang,et al.  Paired Open-Ended Trailblazer (POET): Endlessly Generating Increasingly Complex and Diverse Learning Environments and Their Solutions , 2019, ArXiv.

[41]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[42]  Sergey Levine,et al.  Dynamics-Aware Unsupervised Discovery of Skills , 2019, ICLR.

[43]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[44]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[45]  Raia Hadsell,et al.  Disentangled Cumulants Help Successor Representations Transfer to New Tasks , 2019, ArXiv.

[46]  Richard L. Lewis,et al.  Where Do Rewards Come From , 2009 .

[47]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[48]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..