Hyperbolic Discounting and Learning over Multiple Horizons

Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we revisit the fundamentals of discounting in RL and bridge this disconnect by implementing an RL agent that acts via hyperbolic discounting. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.

[1]  P. Samuelson A Note on Measurement of Utility , 1937 .

[2]  R. H. Strotz Myopia and Inconsistency in Dynamic Utility Maximization , 1955 .

[3]  R. Bellman A Markovian Decision Process , 1957 .

[4]  Richard Bellman,et al.  ON A ROUTING PROBLEM , 1958 .

[5]  G. Ainslie Specious reward: a behavioral theory of impulsiveness and impulse control. , 1975, Psychological bulletin.

[6]  L. Green,et al.  Preference reversal and self control: choice as a function of reward amount and delay , 1981 .

[7]  J. E. Mazur Probability and delay of reinforcement as factors in discrete-trial choice. , 1985, Journal of the experimental analysis of behavior.

[8]  J. E. Mazur An adjusting procedure for studying delayed reinforcement. , 1987 .

[9]  Michael McCloskey,et al.  Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , 1989 .

[10]  S. C. Suddarth,et al.  Rule-Injection Hints as a Means of Improving Network Performance and Learning Time , 1990, EURASIP Workshop.

[11]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[12]  Satinder P. Singh,et al.  Scaling Reinforcement Learning Algorithms by Learning Variable Temporal Resolution Models , 1992, ML.

[13]  Ming Tan,et al.  Multi-Agent Reinforcement Learning: Independent versus Cooperative Agents , 1997, ICML.

[14]  Eugene A. Feinberg,et al.  Markov Decision Models with Weighted Discounted Criteria , 1994, Math. Oper. Res..

[15]  L. Green,et al.  Temporal discounting and preference reversals in choice between delayed outcomes , 1994, Psychonomic bulletin & review.

[16]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[17]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[18]  L. Green,et al.  Discounting of delayed rewards: Models of individual choice. , 1995, Journal of the experimental analysis of behavior.

[19]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[20]  G. Loewenstein Out of control: Visceral influences on behavior , 1996 .

[21]  J. E. Mazur Choice, delay, probability, and conditioned reinforcement , 1997 .

[22]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[23]  A. Kacelnik Normative and descriptive models of decision making: time discounting and risk sensitivity. , 2007, Ciba Foundation symposium.

[24]  L. Green,et al.  Rate of temporal discounting decreases with amount of reward , 1997, Memory & cognition.

[25]  Leslie Pack Kaelbling,et al.  Planning and Acting in Partially Observable Stochastic Domains , 1998, Artif. Intell..

[26]  Peter D. Sozou,et al.  On hyperbolic discounting and uncertain hazard rates , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[27]  Eugene A. Feinberg,et al.  Constrained dynamic programming with two discount factors: applications and an algorithm , 1999, IEEE Trans. Autom. Control..

[28]  David S. Touretzky,et al.  Behavioral considerations suggest an average reward TD model of the dopamine system , 2000, Neurocomputing.

[29]  G. Loewenstein,et al.  Time Discounting and Time Preference: A Critical Review , 2002 .

[30]  N. Daw,et al.  Reinforcement learning models of the dopamine system and their behavioral implications , 2003 .

[31]  Saori C. Tanaka,et al.  Prediction of immediate and future rewards differentially recruits cortico-basal ganglia loops , 2004, Nature Neuroscience.

[32]  L. Green,et al.  A discounting framework for choice with delayed and probabilistic rewards. , 2004, Psychological bulletin.

[33]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[34]  Edmund H. Durfee,et al.  Stationary Deterministic Policies for Constrained MDPs with Multiple Rewards, Costs, and Discount Factors , 2005, IJCAI.

[35]  E. Maskin,et al.  Uncertainty and Hyperbolic Discounting , 2005 .

[36]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[37]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[38]  John R. Anderson,et al.  From recurrent choice to skill learning: a reinforcement-learning model. , 2006, Journal of experimental psychology. General.

[39]  R. French Catastrophic Forgetting in Connectionist Networks , 2006 .

[40]  Colin Camerer,et al.  A framework for studying the neurobiology of value-based decision making , 2008, Nature Reviews Neuroscience.

[41]  Saori C. Tanaka,et al.  Low-Serotonin Levels Increase Delayed Reward Discounting in Humans , 2008, The Journal of Neuroscience.

[42]  T. Maia Reinforcement learning, conditioning, and the brain: Successes and challenges , 2009, Cognitive, affective & behavioral neuroscience.

[43]  Zeb Kurth-Nelson,et al.  Temporal-Difference Reinforcement Learning with Distributed Representations , 2009, PloS one.

[44]  Z. Kurth-Nelson,et al.  Neural Models of Temporal Discounting , 2009 .

[45]  William H. Alexander,et al.  Hyperbolically Discounted Temporal Difference Learning , 2010, Neural Computation.

[46]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[47]  Tor Lattimore,et al.  Time Consistent Discounting , 2011, ALT.

[48]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[49]  Michael L. Littman,et al.  Expressing Tasks Robustly via Multiple Discount Factors , 2015 .

[50]  Damien Ernst,et al.  How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies , 2015, ArXiv.

[51]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[52]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[53]  Doreen Meier Picoeconomics The Strategic Interaction Of Successive Motivational States Within The Person , 2016 .

[54]  Martha White,et al.  Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[55]  Kenji Doya,et al.  Average Reward Optimization with Multiple Discounting Reinforcement Learners , 2017, ICONIP.

[56]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[57]  Razvan Pascanu,et al.  Learning to Navigate in Complex Environments , 2016, ICLR.

[58]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[59]  Guillaume Lample,et al.  Playing FPS Games with Deep Reinforcement Learning , 2016, AAAI.

[60]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[61]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[62]  Satinder Singh,et al.  Many-Goals Reinforcement Learning , 2018, ArXiv.

[63]  Marlos C. Machado,et al.  Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents , 2017, J. Artif. Intell. Res..

[64]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[65]  P. Pilarski,et al.  Generalizing Value Estimation over Timescale , 2018 .

[66]  Silviu Pitis,et al.  Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach , 2019, AAAI.

[67]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[68]  Joelle Pineau,et al.  Separating value functions across time-scales , 2019, ICML 2019.