The Phenomenon of Policy Churn

We identify and study the phenomenon of policy churn , that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneficial but overlooked form of implicit exploration that casts (cid:15) -greedy exploration in a fresh light, namely that (cid:15) -noise plays a much smaller role than expected.

[1]  Ameet Talwalkar,et al.  Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , 2021, ICLR.

[2]  R. Sarpong,et al.  Bio-inspired synthesis of xishacorenes A, B, and C, and a new congener from fuscol† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02572c , 2019, Chemical science.

[3]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[4]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[5]  Georg Ostrovski,et al.  The Difficulty of Passive Learning in Deep Reinforcement Learning , 2021, NeurIPS.

[6]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[7]  Paul Wagner,et al.  Policy oscillation is overshooting , 2014, Neural Networks.

[8]  Michael J. Goard,et al.  Stimulus-dependent representational drift in primary visual cortex , 2020, Nature Communications.

[9]  Demis Hassabis,et al.  Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[10]  Tor Lattimore,et al.  Behaviour Suite for Reinforcement Learning , 2019, ICLR.

[11]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[12]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[13]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[14]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[15]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[16]  A. Gray,et al.  I. THE ORIGIN OF SPECIES BY MEANS OF NATURAL SELECTION , 1963 .

[17]  Neil Genzlinger A. and Q , 2006 .

[18]  Wenlong Fu,et al.  Model-based reinforcement learning: A survey , 2018 .

[19]  Tom Schaul,et al.  Adapting Behaviour for Learning Progress , 2019, ArXiv.

[20]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[21]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[22]  W. Hager,et al.  and s , 2019, Shallow Water Hydraulics.

[23]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[24]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[25]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[26]  Mark J. Nelson Estimates for the Branching Factors of Atari Games , 2021, 2021 IEEE Conference on Games (CoG).

[27]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  G. Mongillo,et al.  Intrinsic volatility of synaptic connections — a challenge to the synaptic trace theory of memory , 2017, Current Opinion in Neurobiology.

[30]  Alon Rubin,et al.  Representational drift in the mouse visual cortex , 2020, Current Biology.

[31]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[32]  Tom Schaul,et al.  Return-based Scaling: Yet Another Normalisation Trick for Deep RL , 2021, ArXiv.

[33]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[34]  Wulfram Gerstner,et al.  Tag-Trigger-Consolidation: A Model of Early and Late Long-Term-Potentiation and Depression , 2008, PLoS Comput. Biol..

[35]  R. Munos,et al.  Revisiting Peng's Q(λ) for Modern Reinforcement Learning , 2021, ICML.

[36]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[37]  Georg Ostrovski,et al.  Temporally-Extended ε-Greedy Exploration , 2020, ICLR.

[38]  T. Schaul,et al.  When should agents explore? , 2021, ICLR.

[39]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[40]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[41]  Richard Axel,et al.  Representational drift in primary olfactory cortex , 2020, Nature.

[42]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[43]  Marc G. Bellemare,et al.  Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.