Reactive Exploration to Cope with Non-Stationarity in Lifelong Reinforcement Learning

In lifelong learning, an agent learns throughout its entire life without resets, in a constantly changing environment, as we humans do. Consequently, lifelong learning comes with a plethora of research problems such as continual domain shifts, which result in non-stationary rewards and environment dynamics. These non-stationarities are difficult to detect and cope with due to their continuous nature. Therefore, exploration strategies and learning methods are required that are capable of tracking the steady domain shifts, and adapting to them. We propose Reactive Exploration to track and react to continual domain shifts in lifelong reinforcement learning, and to update the policy correspondingly. To this end, we conduct experiments in order to investigate different exploration strategies. We empirically show that representatives of the policy-gradient family are better suited for lifelong learning, as they adapt more quickly to distribution shifts than Q-learning. Thereby, policy-gradient methods profit the most from Reactive Exploration and show good results in lifelong learning with continual domain shifts. Our code is available at: https://github.com/ml-jku/reactive-exploration.

[1]  S. Hochreiter,et al.  History Compression via Language Models in Reinforcement Learning , 2022, ICML.

[2]  M. Mitchell,et al.  Abstraction for Deep Reinforcement Learning , 2022, IJCAI.

[3]  Marcello Restelli,et al.  Lifelong Hyper-Policy Optimization with Multiple Importance Sampling Regularization , 2021, AAAI.

[4]  Satinder Singh,et al.  Bootstrapped Meta-Learning , 2021, ICLR.

[5]  Stephen J. Roberts,et al.  Same State, Different Task: Continual Reinforcement Learning without Interference , 2021, AAAI.

[6]  Razvan Pascanu,et al.  Continual World: A Robotic Benchmark For Continual Reinforcement Learning , 2021, NeurIPS.

[7]  Ana L. C. Bazzan,et al.  Minimum-Delay Adaptation in Non-Stationary Reinforcement Learning via Online High-Confidence Change-Point Detection , 2021, AAMAS.

[8]  Doina Precup,et al.  Towards Continual Reinforcement Learning: A Review and Perspectives , 2020, J. Artif. Intell. Res..

[9]  Sepp Hochreiter,et al.  Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER , 2020, Trans. Large Scale Data Knowl. Centered Syst..

[10]  David P. Kreil,et al.  Cross-Domain Few-Shot Learning by Representation Fusion , 2020, ArXiv.

[11]  Jose A. Arjona-Medina,et al.  Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution , 2020, ICML.

[12]  Gunshi Gupta,et al.  La-MAML: Look-ahead Meta Learning for Continual Learning , 2020, NeurIPS.

[13]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[14]  Ali Farhadi,et al.  Supermasks in Superposition , 2020, NeurIPS.

[15]  David Simchi-Levi,et al.  Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism , 2020, ICML.

[16]  Benjamin F. Grewe,et al.  Continual Learning in Recurrent Neural Networks with Hypernetworks , 2020, ArXiv.

[17]  Zhi-Hua Zhou,et al.  A Simple Approach for Non-stationary Linear Bandits , 2020, AISTATS.

[18]  Sridhar Mahadevan,et al.  Optimizing for the Future in Non-Stationary MDPs , 2020, ICML.

[19]  Tim Rocktäschel,et al.  RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments , 2020, ICLR.

[20]  Tom Mitchell,et al.  Jelly Bean World: A Testbed for Never-Ending Learning , 2020, ICLR.

[21]  Daniel Guo,et al.  Never Give Up: Learning Directed Exploration Strategies , 2020, ICLR.

[22]  O. Cappé,et al.  Weighted Linear Bandits for Non-Stationary Environments , 2019, NeurIPS.

[23]  Hao Wang,et al.  Forward and Backward Knowledge Transfer for Sentiment Classification , 2019, ACML.

[24]  Martha White,et al.  Meta-Learning Representations for Continual Learning , 2019, NeurIPS.

[25]  Erwan Lecarpentier,et al.  Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning , 2019, NeurIPS.

[26]  Philip S. Thomas,et al.  A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning , 2019, AAMAS.

[27]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[28]  Nitin Singh,et al.  Change point detection for compositional multivariate data , 2019, Applied Intelligence.

[29]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[30]  Nathan D. Cahill,et al.  Memory Efficient Experience Replay for Streaming Learning , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[31]  Marlos C. Machado,et al.  Count-Based Exploration with the Successor Representation , 2018, AAAI.

[32]  Davide Maltoni,et al.  Continuous Learning in Single-Incremental-Task Scenarios , 2018, Neural Networks.

[33]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[34]  Stefan Wermter,et al.  Lifelong Learning of Spatiotemporal Representations With Dual-Memory Recurrent Self-Organization , 2018, Front. Neurorobot..

[35]  Qingyun Wu,et al.  Learning Contextual Bandits in a Non-stationary Environment , 2018, SIGIR.

[36]  Zhanxing Zhu,et al.  Reinforced Continual Learning , 2018, NeurIPS.

[37]  Elad Hoffer,et al.  Task-Agnostic Continual Learning Using Online Variational Bayes With Fixed-Point Updates , 2018, Neural Computation.

[38]  Zheng Wen,et al.  Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit , 2018, AISTATS.

[39]  Alexandros Karatzoglou,et al.  Overcoming Catastrophic Forgetting with Hard Attention to the Task , 2018 .

[40]  Svetlana Lazebnik,et al.  PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  Sheetal Kalyani,et al.  Taming Non-stationary Bandits: A Bayesian Approach , 2017, ArXiv.

[42]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[43]  Marcus Hutter,et al.  Count-Based Exploration in Feature Space for Reinforcement Learning , 2017, IJCAI.

[44]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[45]  Byoung-Tak Zhang,et al.  Overcoming Catastrophic Forgetting by Incremental Moment Matching , 2017, NIPS.

[46]  Surya Ganguli,et al.  Continual Learning Through Synaptic Intelligence , 2017, ICML.

[47]  S. Shankar Sastry,et al.  Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning , 2017, ArXiv.

[48]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[49]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[50]  Chrisantha Fernando,et al.  PathNet: Evolution Channels Gradient Descent in Super Neural Networks , 2017, ArXiv.

[51]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[52]  Filip De Turck,et al.  #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning , 2016, NIPS.

[53]  Derek Hoiem,et al.  Learning without Forgetting , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[55]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[56]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[57]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[58]  Omar Besbes,et al.  Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards , 2014, NIPS.

[59]  Naoyuki Kubota,et al.  Reinforcement Learning in non-stationary environments: An intrinsically motivated stress based memory retrieval performance (SBMRP) model , 2014, 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

[60]  Omar Besbes,et al.  Optimal Exploration-Exploitation in a Multi-Armed-Bandit Problem with Non-Stationary Rewards , 2014, Stochastic Systems.

[61]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[62]  Martial Mermillod,et al.  The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects , 2013, Front. Psychol..

[63]  David Whitney,et al.  Motion-Dependent Representation of Space in Area MT+ , 2013, Neuron.

[64]  Xinlei Chen,et al.  Never-Ending Learning , 2012, ECAI.

[65]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[66]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[67]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[68]  Aurélien Garivier,et al.  On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems , 2008 .

[69]  Ryan P. Adams,et al.  Bayesian Online Changepoint Detection , 2007, 0710.3742.

[70]  Pierre-Yves Oudeyer,et al.  What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[71]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[72]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[73]  R. French Catastrophic forgetting in connectionist networks , 1999, Trends in Cognitive Sciences.

[74]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[75]  Sebastian Thrun,et al.  Lifelong Learning Algorithms , 1998, Learning to Learn.

[76]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[77]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[78]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[79]  David A. Cohn,et al.  Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[80]  Sebastian Thrun,et al.  Lifelong robot learning , 1993, Robotics Auton. Syst..

[81]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[82]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[83]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[84]  Jürgen Schmidhuber,et al.  A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[85]  S. Grossberg,et al.  ART 2: self-organization of stable category recognition codes for analog input patterns. , 1987, Applied optics.

[86]  J. Forrester Counterintuitive behavior of social systems , 1971 .

[87]  J. Knott The organization of behavior: A neuropsychological theory , 1951 .

[88]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[89]  K. Keutzer,et al.  NovelD: A Simple yet Effective Exploration Criterion , 2021, NeurIPS.

[90]  Amr Ahmed,et al.  Non-Stationary Off-Policy Optimization , 2021, AISTATS.

[91]  Daniel Hennes,et al.  Neural Replicator Dynamics: Multiagent Learning via Hedging Policy Gradients , 2020, AAMAS.

[92]  Marcello Restelli,et al.  Model-Free Non-Stationarity Detection and Adaptation in Reinforcement Learning , 2020, ECAI.

[93]  Jose A. Arjona-Medina,et al.  XAI and Strategy Extraction via Reward Redistribution , 2020, xxAI@ICML.

[94]  Yoshua Bengio,et al.  On the Optimization of a Synaptic Learning Rule , 2007 .

[95]  Michèle Sebag,et al.  Change Point Detection and Meta-Bandits for Online Learning in Dynamic Environments , 2007 .

[96]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[97]  Manfred Huber,et al.  Subgoal Discovery for Hierarchical Reinforcement Learning Using Learned Policies , 2003, FLAIRS.

[98]  S. Hochreiter,et al.  REINFORCEMENT DRIVEN INFORMATION ACQUISITION IN NONDETERMINISTIC ENVIRONMENTS , 1995 .

[99]  Mark B. Ring Continual learning in reinforcement environments , 1995, GMD-Bericht.

[100]  Michèle Basseville,et al.  Detection of abrupt changes , 1993 .

[101]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[102]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .