A primer on reinforcement learning in the brain : Psychological, computational, and neural perspectives

In the last 15 years, there has been a flourishing of research into the neural basis of reinforcement learning, drawing together insights and findings from psychology, computer science, and neuroscience. This remarkable confluence of three fields has yielded a growing framework that begins to explain how animals and humans learn to make decisions in real time. Mastering the literature in this sub-field can be quite daunting as this task can require mastery of at least three different disciplines, each with its own jargon, perspectives, and shared background knowledge. In this chapter, we attempt to make this fascinating line of research more accessible to researchers in any of the constitutive sub-disciplines. To this end, we develop a primer for reinforcement learning in the brain that lays out in plain language many of the key ideas and concepts that underpin research in this area. This primer is embedded in a literature review that aims not to be comprehensive, but rather representative of the types of questions and answers that have arisen in the quest to understand reinforcement learning and its neural substrates. Drawing on the basic findings in this research enterprise, we conclude with some speculations about how these developments in computational neuroscience may influence future developments in Artificial Intelligence.

[1]  E. Guthrie Conditioning as a principle of learning. , 1930 .

[2]  W. Brogden Sensory pre-conditioning. , 1939 .

[3]  D. Bernoulli Exposition of a New Theory on the Measurement of Risk , 1954 .

[4]  R J HERRNSTEIN,et al.  Relative and absolute strength of response as a function of frequency of reinforcement. , 1961, Journal of the experimental analysis of behavior.

[5]  John Garcia,et al.  Relation of cue to consequence in avoidance learning , 1966 .

[6]  L. Kamin Predictability, surprise, attention, and conditioning , 1967 .

[7]  R. Rescorla Probability of shock in the presence and absence of CS in fear conditioning. , 1968, Journal of comparative and physiological psychology.

[8]  R. Herrnstein On the law of effect. , 1970, Journal of the experimental analysis of behavior.

[9]  R. Rescorla A theory of pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement , 1972 .

[10]  A. Tversky,et al.  Prospect theory: analysis of decision under risk , 1979 .

[11]  R. Rescorla Simultaneous and successive associations in sensory preconditioning. , 1980, Journal of experimental psychology. Animal behavior processes.

[12]  J. Pearce,et al.  A model for Pavlovian learning: Variations in the effectiveness of conditioned but not of unconditioned stimuli. , 1980 .

[13]  Christopher D. Adams,et al.  Instrumental Responding following Reinforcer Devaluation , 1981 .

[14]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[15]  A. Tversky,et al.  The framing of decisions and the psychology of choice. , 1981, Science.

[16]  R. Rescorla,et al.  Postconditioning devaluation of a reinforcer affects instrumental responding. , 1985 .

[17]  R. Rescorla Pavlovian conditioning. It's not what you think it is. , 1988, The American psychologist.

[18]  M. Davison,et al.  The matching law: A research review. , 1988 .

[19]  C. Watkins Learning from delayed rewards , 1989 .

[20]  T. Caraco,et al.  Risk-sensitivity: ambient temperature affects foraging choice , 1990, Animal Behaviour.

[21]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[22]  B. Balleine,et al.  Motivational control of goal-directed action , 1994 .

[23]  A. Kacelnik,et al.  Preferences for fixed and variable food sources: variability in amount and delay. , 1995, Journal of the experimental analysis of behavior.

[24]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[25]  Ralph R. Miller,et al.  Assessment of the Rescorla-Wagner model. , 1995 .

[26]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[27]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[28]  William Bialek,et al.  Spikes: Exploring the Neural Code , 1996 .

[29]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[30]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[31]  B. Balleine,et al.  Goal-directed instrumental action: contingency and incentive learning and their cortical substrates , 1998, Neuropharmacology.

[32]  S. Shafir Risk-sensitive foraging: the effect of relative variability , 2000 .

[33]  C. Gallistel,et al.  Time, rate, and conditioning. , 2000, Psychological review.

[34]  J. Wickens,et al.  A cellular mechanism of reward-related learning , 2001, Nature.

[35]  N. Logothetis,et al.  Neurophysiological investigation of the basis of the fMRI signal , 2001, Nature.

[36]  W. Schultz,et al.  Dopamine responses comply with basic assumptions of formal learning theory , 2001, Nature.

[37]  J. Pearce,et al.  Theories of associative learning in animals. , 2001, Annual review of psychology.

[38]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[39]  M. Platt,et al.  Weighing the Evidence: Neural Correlates of Sensory Judgements Neural Correlates of Decisions Remembrance of Things Past: Neural Correlates of Decisions Derived from Prior Knowledge , 2022 .

[40]  W. Schultz Getting Formal with Dopamine and Reward , 2002, Neuron.

[41]  Eytan Ruppin,et al.  Actor-critic models of the basal ganglia: new anatomical and computational perspectives , 2002, Neural Networks.

[42]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[43]  A. Kacelnik,et al.  Framing effects and risky decisions in starlings , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Colin Camerer,et al.  Behavioral Economics: Past, Present, Future , 2003 .

[45]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[46]  W. Newsome,et al.  Matching Behavior and the Representation of Value in the Parietal Cortex , 2004, Science.

[47]  J J McDowell,et al.  A computational model of selection by consequences. , 2004, Journal of the experimental analysis of behavior.

[48]  Richard S. Sutton,et al.  Associative search network: A reinforcement learning associative memory , 1981, Biological Cybernetics.

[49]  Karl J. Friston,et al.  Dissociable Roles of Ventral and Dorsal Striatum in Instrumental Conditioning , 2004, Science.

[50]  L. Green,et al.  A discounting framework for choice with delayed and probabilistic rewards. , 2004, Psychological bulletin.

[51]  T. Caraco Energy budgets, risk and foraging preferences in dark-eyed juncos (Junco hyemalis) , 1981, Behavioral Ecology and Sociobiology.

[52]  Matthew T. Kaufman,et al.  Distributed Neural Representation of Expected Value , 2005, The Journal of Neuroscience.

[53]  K. Doya,et al.  Representation of Action-Specific Reward Values in the Striatum , 2005, Science.

[54]  R. Poldrack,et al.  Prospect theory on the brain? Toward a cognitive neuroscience of decision under risk. , 2005, Brain research. Cognitive brain research.

[55]  W. Schultz,et al.  Adaptive Coding of Reward Value by Dopamine Neurons , 2005, Science.

[56]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[57]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[58]  W. Newsome,et al.  Choosing the greater of two goods: neural currencies for valuation and decision making , 2005, Nature Reviews Neuroscience.

[59]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[60]  Peter Dayan,et al.  How fast to work: Response vigor, motivation and tonic dopamine , 2005, NIPS.

[61]  M. Domjan Pavlovian conditioning: a functional perspective. , 2005, Annual review of psychology.

[62]  P. Glimcher,et al.  JOURNAL OF THE EXPERIMENTAL ANALYSIS OF BEHAVIOR 2005, 84, 555–579 NUMBER 3(NOVEMBER) DYNAMIC RESPONSE-BY-RESPONSE MODELS OF MATCHING BEHAVIOR IN RHESUS MONKEYS , 2022 .

[63]  Constantin F. Aliferis,et al.  Predicting dire outcomes of patients with community acquired pneumonia , 2005, J. Biomed. Informatics.

[64]  John McCarthy,et al.  A Proposal for the Dartmouth Summer Research Project on Artificial Intelligence, August 31, 1955 , 2006, AI Mag..

[65]  P. Dayan,et al.  Cortical substrates for exploratory decisions in humans , 2006, Nature.

[66]  E. Vaadia,et al.  Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[67]  Michael R. Waldmann,et al.  Causal Reasoning in Rats , 2006, Science.

[68]  David S. Touretzky,et al.  Representation and Timing in Theories of the Dopamine System , 2006, Neural Computation.

[69]  K. Doya,et al.  The computational neurobiology of learning and reward , 2006, Current Opinion in Neurobiology.

[70]  A. Tversky,et al.  Prospect theory: an analysis of decision under risk — Source link , 2007 .

[71]  J. O'Doherty,et al.  Reward Value Coding Distinct From Risk Attitude-Related Uncertainty Coding in Human Reward Systems , 2006, Journal of neurophysiology.

[72]  K. Doya,et al.  Multiple Representations of Belief States and Action Values in Corticobasal Ganglia Loops , 2007, Annals of the New York Academy of Sciences.

[73]  W. Schultz Multiple dopamine functions at different time courses. , 2007, Annual review of neuroscience.

[74]  Ralph R. Miller,et al.  Sometimes-competing retrieval (SOCR): a formalization of the comparator hypothesis. , 2007, Psychological review.

[75]  M. Roesch,et al.  Dopamine neurons encode the better option in rats deciding between differently delayed or sized rewards , 2007, Nature Neuroscience.

[76]  H. Robbins Some aspects of the sequential design of experiments , 1952 .

[77]  Kenji Doya,et al.  Reinforcement learning: Computational theory and biological mechanisms , 2007, HFSP journal.

[78]  Steven C Stout,et al.  Sometimes-competing retrieval (SOCR): a formalization of the comparator hypothesis. , 2007, Psychological review.

[79]  Anna Koop,et al.  Learning to Generalize through Predictive Representations: A Computational Model of Mediated Conditioning , 2008, SAB.

[80]  Colin Camerer,et al.  A framework for studying the neurobiology of value-based decision making , 2008, Nature Reviews Neuroscience.

[81]  Richard S. Sutton,et al.  Stimulus Representation and the Timing of Reward-Prediction Errors in Models of the Dopamine System , 2008, Neural Computation.

[82]  Gal Yadid,et al.  Dynamics of the dopaminergic system as a key component to the understanding of depression. , 2008, Progress in brain research.

[83]  P. Dayan,et al.  Decision theory, reinforcement learning, and the brain , 2008, Cognitive, affective & behavioral neuroscience.

[84]  Y. Niv,et al.  Dialogues on prediction errors , 2008, Trends in Cognitive Sciences.

[85]  Richard S. Sutton,et al.  A computational model of hippocampal function in trace conditioning , 2008, NIPS.

[86]  Yutaka Sakai,et al.  The Actor-Critic Learning Is Behind the Matching Law: Matching Versus Optimal Behaviors , 2008, Neural Computation.

[87]  Douglas A. Williams,et al.  Timed excitatory conditioning under zero and negative contingencies. , 2008, Journal of experimental psychology. Animal behavior processes.

[88]  Timothy E. J. Behrens,et al.  Choice, uncertainty and value in prefrontal and cingulate cortex , 2008, Nature Neuroscience.

[89]  W. Schultz Introduction. Neuroeconomics: the promise and the profit , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[90]  P. Dayan,et al.  Reinforcement learning: The Good, The Bad and The Ugly , 2008, Current Opinion in Neurobiology.

[91]  Daniel A. Gottlieb Is the number of trials a primary determinant of conditioned responding? , 2008, Journal of experimental psychology. Animal behavior processes.

[92]  M. Platt,et al.  Risky business: the neuroeconomics of decision making under uncertainty , 2008, Nature Neuroscience.

[93]  J. Staddon,et al.  The behavioral economics of choice and interval timing. , 2009, Psychological review.

[94]  Klaus Wunderlich,et al.  Neural computations underlying action-based decision making in the human brain , 2009, Proceedings of the National Academy of Sciences.

[95]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[96]  Nasimeh Asgarian,et al.  Learning to predict relapse in invasive ductal carcinomas based on the subcellular localization of junctional proteins , 2010, Breast Cancer Research and Treatment.

[97]  Y. Niv Reinforcement learning in the brain , 2009 .

[98]  T. Maia Reinforcement learning, conditioning, and the brain: Successes and challenges , 2009, Cognitive, affective & behavioral neuroscience.

[99]  H. Sebastian Seung,et al.  Operant Matching as a Nash Equilibrium of an Intertemporal Game , 2009, Neural Computation.

[100]  Zeb Kurth-Nelson,et al.  Temporal-Difference Reinforcement Learning with Distributed Representations , 2009, PloS one.

[101]  S. Kennerley,et al.  Evaluating choices by single neurons in the frontal lobe: outcome value encoded across multiple decision variables , 2009, The European journal of neuroscience.

[102]  B. Love,et al.  Short-term gains, long-term pains: How cues about state aid learning in dynamic environments , 2009, Cognition.

[103]  I. Izquierdo,et al.  Dopamine Controls Persistence of Long-Term Memory Storage , 2009, Science.

[104]  R. C. Honey,et al.  "Causal reasoning" in rats: a reappraisal. , 2009, Journal of experimental psychology. Animal behavior processes.

[105]  Jonathan D. Cohen,et al.  Explicit melioration by a neural diffusion model , 2009, Brain Research.

[106]  K. Doya,et al.  Validation of Decision-Making Models and Analysis of Decision Variables in the Rat Basal Ganglia , 2009, The Journal of Neuroscience.

[107]  Jung Hoon Sul,et al.  Role of Striatum in Updating Values of Chosen Actions , 2009, The Journal of Neuroscience.

[108]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[109]  M. Rushworth,et al.  General Mechanisms for Making Decisions? This Review Comes from a Themed Issue on Cognitive Neuroscience Edited the Representation of Value and Reward Expectations in Frontal Cortex Reward Prediction Errors and Learning Rates Other Types of Prediction Error , 2022 .

[110]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[111]  K. Deisseroth,et al.  Phasic Firing in Dopaminergic Neurons Is Sufficient for Behavioral Conditioning , 2009, Science.

[112]  C. Pennartz,et al.  Single-Cell and Population Coding of Expected Reward Probability in the Orbitofrontal Cortex of the Rat , 2009, The Journal of Neuroscience.

[113]  C. Gallistel,et al.  Memory and the Computational Brain , 2009 .

[114]  M. Roesch,et al.  Ventral Striatal Neurons Encode the Value of the Chosen Action in Rats Deciding between Differently Delayed or Sized Rewards , 2009, The Journal of Neuroscience.

[115]  M. Roesch,et al.  A new perspective on the role of the orbitofrontal cortex in adaptive behaviour , 2009, Nature Reviews Neuroscience.

[116]  A. Hama Predictably Irrational: The Hidden Forces That Shape Our Decisions , 2010 .

[117]  Mirko Farina Supersizing the Mind: Embodiment, Action and Cognitive Extension. , 2010 .

[118]  B. Balleine,et al.  Human and Rodent Homologies in Action Control: Corticostriatal Determinants of Goal-Directed and Habitual Action , 2010, Neuropsychopharmacology.

[119]  P. I. Pavlov Conditioned reflexes: An investigation of the physiological activity of the cerebral cortex. , 1929, Annals of Neurosciences.