When should agents explore?

Exploration remains a central challenge for reinforcement learning (RL). Virtually all existing methods share the feature of a monolithic behaviour policy that changes only gradually (at best). In contrast, the exploratory behaviours of animals and humans exhibit a rich diversity, namely including forms of switching between modes. This paper presents an initial study of mode-switching, non-monolithic exploration for RL. We investigate different modes to switch between, at what timescales it makes sense to switch, and what signals make for good switching triggers. We also propose practical algorithmic components that make the switching mechanism adaptive and robust, which enables flexibility without an accompanying hyperparameter-tuning burden. Finally, we report a promising and detailed analysis on Atari, using two-mode exploration and switching at sub-episodic time-scales.

[1]  T. Power Play and Exploration in Children and Animals , 1999 .

[2]  Xia Hu,et al.  Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments , 2021, ICLR.

[3]  Charles Blundell,et al.  Coverage as a Principle for Discovering Transferable Behavior in Reinforcement Learning , 2021, ArXiv.

[4]  Shie Mannor,et al.  Adaptive Skills Adaptive Partitions (ASAP) , 2016, NIPS.

[5]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[6]  Michael Buro,et al.  Build Order Optimization in StarCraft , 2011, AIIDE.

[7]  Shie Mannor,et al.  A Bayesian Approach to Robust Reinforcement Learning , 2019, UAI.

[8]  S. Nelson,et al.  Homeostatic plasticity in the developing nervous system , 2004, Nature Reviews Neuroscience.

[9]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[10]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[11]  C. Breazeal,et al.  Experiments in socially guided exploration: lessons learned in building robots that learn with and without human teachers , 2008, Connect. Sci..

[12]  John Langford,et al.  Efficient Exploration in Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[13]  R Becket Ebitz,et al.  Tonic exploration governs both flexibility and lapses , 2019, PLoS Comput. Biol..

[14]  J. Peters,et al.  Dopaminergic modulation of the exploration/exploitation trade-off in human decision-making , 2019, bioRxiv.

[15]  Marlos C. Machado,et al.  Exploration in Reinforcement Learning with Deep Covering Options , 2020, ICLR.

[16]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[17]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[18]  Tom Schaul,et al.  Adapting Behaviour for Learning Progress , 2019, ArXiv.

[19]  Daan Wierstra,et al.  Variational Intrinsic Control , 2016, ICLR.

[20]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[21]  George Konidaris,et al.  Discovering Options for Exploration by Minimizing Cover Time , 2019, ICML.

[22]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[23]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[24]  Aldo Pacchiano,et al.  Deep Reinforcement Learning with Dynamic Optimism , 2021, ArXiv.

[25]  Andrew R. Mitz,et al.  Subcortical Substrates of Explore-Exploit Decisions in Primates , 2019, Neuron.

[26]  Frederic Bartumeus,et al.  Bumblebees learn foraging routes through exploitation–exploration cycles , 2019, Journal of the Royal Society Interface.

[27]  Angela J. Yu,et al.  Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[28]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[29]  Tom Schaul,et al.  Universal Successor Features Approximators , 2018, ICLR.

[30]  Rahul Bhui,et al.  Structured, uncertainty-driven exploration in real-world consumer choice , 2019, Proceedings of the National Academy of Sciences.

[31]  Michel Tokic Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[32]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[33]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[34]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[35]  Ryutaro Ichise,et al.  Fast and slow curiosity for high-level exploration in reinforcement learning , 2020, Appl. Intell..

[36]  R. Munos,et al.  Kullback–Leibler upper confidence bounds for optimal sequential allocation , 2012, 1210.1136.

[37]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[38]  Doina Precup,et al.  The Option Keyboard: Combining Skills in Reinforcement Learning , 2021, NeurIPS.

[39]  Samuel J. Gershman,et al.  Dopaminergic genes are associated with both directed and random exploration , 2018, Neuropsychologia.

[40]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[41]  Razvan Pascanu,et al.  Temporal Difference Uncertainties as a Signal for Exploration , 2020, ArXiv.

[42]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[43]  Filippo Radicchi,et al.  Levy flights in human behavior and cognition , 2013, 1306.6533.

[44]  Georg Ostrovski,et al.  Temporally-Extended ε-Greedy Exploration , 2020, ICLR.

[45]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[46]  Susan L. Franzel,et al.  Guided search: an alternative to the feature integration model for visual search. , 1989, Journal of experimental psychology. Human perception and performance.

[47]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[48]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[49]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[50]  Martha White,et al.  Adapting Behaviour via Intrinsic Reward: A Survey and Empirical Study , 2019, ArXiv.

[51]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[52]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[53]  Robert C. Wilson,et al.  Differential Effects of Psychotic Illness on Directed and Random Exploration , 2020, Computational Psychiatry.

[54]  Daniel Guo,et al.  Agent57: Outperforming the Atari Human Benchmark , 2020, ICML.

[55]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  Marco Wiering,et al.  Ensemble Algorithms in Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[58]  Steven Latré,et al.  Learning Intrinsically Motivated Options to Stimulate Policy Exploration , 2020 .

[59]  Thomas T. Hills,et al.  Exploration versus exploitation in space, mind, and society , 2015, Trends in Cognitive Sciences.

[60]  Terence Hwa,et al.  Chemotaxis as a navigation strategy to boost range expansion , 2019, Nature.

[61]  Anjali Raja Beharelle,et al.  Increased random exploration in schizophrenia is associated with inflammation , 2020, bioRxiv.

[62]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[63]  J. Downar,et al.  A cortical network sensitive to stimulus salience in a neutral behavioral context across multiple sensory modalities. , 2002, Journal of neurophysiology.

[64]  Chrystopher L. Nehaniv,et al.  Empowerment: a universal agent-centric measure of control , 2005, 2005 IEEE Congress on Evolutionary Computation.

[65]  S. Gershman Deconstructing the human algorithms for exploration , 2018, Cognition.

[66]  Tom Schaul,et al.  Return-based Scaling: Yet Another Normalisation Trick for Deep RL , 2021, ArXiv.