Stimulus Representation and the Timing of Reward-Prediction Errors in Models of the Dopamine System

The phasic firing of dopamine neurons has been theorized to encode a reward-prediction error as formalized by the temporal-difference (TD) algorithm in reinforcement learning. Most TD models of dopamine have assumed a stimulus representation, known as the complete serial compound, in which each moment in a trial is distinctly represented. We introduce a more realistic temporal stimulus representation for the TD model. In our model, all external stimuli, including rewards, spawn a series of internal microstimuli, which grow weaker and more diffuse over time. These microstimuli are used by the TD learning algorithm to generate predictions of future reward. This new stimulus representation injects temporal generalization into the TD model and enhances correspondence between model and data in several experiments, including those when rewards are omitted or received early. This improved fit mostly derives from the absence of large negative errors in the new model, suggesting that dopamine alone can encode the full range of TD errors in these situations.

[1]  L. Kamin Predictability, surprise, attention, and conditioning , 1967 .

[2]  M. C. Smith,et al.  CS-US interval and US intensity in classical conditioning of the rabbit's nictitating membrane response. , 1968, Journal of comparative and physiological psychology.

[3]  B. Campbell,et al.  Punishment and aversive behavior , 1969 .

[4]  J. Gibbon Scalar expectancy theory and Weber's law in animal timing. , 1977 .

[5]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[6]  E. Kehoe,et al.  Blocking acquisition of the rabbit's nictitating membrane response to serial conditioned stimuli , 1981 .

[7]  E. Kehoe,et al.  Temporal primacy overrides prior training in serial compound conditioning of the rabbit’s nictitating membrane response , 1987 .

[8]  Stephen Grossberg,et al.  Neural dynamics of adaptive timing and temporal discrimination during associative learning , 1989, Neural Networks.

[9]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[10]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[11]  R. R. Miller,et al.  Temporal encoding as a determinant of blocking. , 1993, Journal of experimental psychology. Animal behavior processes.

[12]  P. Dayan,et al.  A framework for mesencephalic dopamine systems based on predictive Hebbian learning , 1996, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[13]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[14]  A. Machado Learning the temporal dynamics of behavior. , 1997, Psychological review.

[15]  W. Schultz,et al.  Learning of sequential movements by neural network model with dopamine-like reinforcement signal , 1998, Experimental Brain Research.

[16]  J. Hollerman,et al.  Dopamine neurons report an error in the temporal prediction of reward during learning , 1998, Nature Neuroscience.

[17]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[18]  W. Schultz,et al.  A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task , 1999, Neuroscience.

[19]  Joshua W. Brown,et al.  How the Basal Ganglia Use Parallel Excitatory and Inhibitory Learning Pathways to Selectively Respond to Unexpected Rewarding Cues , 1999, The Journal of Neuroscience.

[20]  W. Schultz,et al.  Dopamine responses comply with basic assumptions of formal learning theory , 2001, Nature.

[21]  Sham M. Kakade,et al.  Opponent interactions between serotonin and dopamine , 2002, Neural Networks.

[22]  W. Schultz,et al.  Discrete Coding of Reward Probability and Uncertainty by Dopamine Neurons , 2003, Science.

[23]  Jonathan D. Cohen,et al.  Computational roles for dopamine in behavioural control , 2004, Nature.

[24]  J. W. Moore,et al.  Adaptive timing in neural networks: The conditioned response , 1988, Biological Cybernetics.

[25]  Michael J. Frank,et al.  By Carrot or by Stick: Cognitive Reinforcement Learning in Parkinsonism , 2004, Science.

[26]  A. Redish,et al.  Addiction as a Computational Process Gone Awry , 2004, Science.

[27]  W. Pan,et al.  Dopamine Cells Respond to Predicted Events during Classical Conditioning: Evidence for Eligibility Traces in the Reward-Learning Network , 2005, The Journal of Neuroscience.

[28]  M. Gluck,et al.  The role of dopamine in cognitive sequence learning: evidence from Parkinson’s disease , 2005, Behavioural Brain Research.

[29]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[30]  P. Glimcher,et al.  Midbrain Dopamine Neurons Encode a Quantitative Reward Prediction Error Signal , 2005, Neuron.

[31]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[32]  P. Dayan,et al.  Dopamine, uncertainty and TD learning , 2005, Behavioral and Brain Functions.

[33]  W. Schultz,et al.  Evidence that the delay-period activity of dopamine neurons corresponds to reward uncertainty rather than backpropagating TD errors , 2005, Behavioral and Brain Functions.

[34]  K. Berridge The debate over dopamine’s role in reward: the case for incentive salience , 2007, Psychopharmacology.

[35]  J. Wearden,et al.  Scalar Properties in Human Timing: Conformity and Violations , 2006 .

[36]  David S. Touretzky,et al.  Representation and Timing in Theories of the Dopamine System , 2006, Neural Computation.

[37]  P. Glimcher,et al.  Statistics of midbrain dopamine neuron spike trains in the awake primate. , 2007, Journal of neurophysiology.

[38]  Thomas E. Hazy,et al.  PVLV: the primary value and learned value Pavlovian learning algorithm. , 2007, Behavioral neuroscience.

[39]  Jadin C. Jackson,et al.  Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gambling. , 2007, Psychological review.

[40]  J. Wearden,et al.  Scalar Properties in Animal Timing: Conformity and Violations , 2006, Quarterly journal of experimental psychology.

[41]  Yael Niv,et al.  OPERANT CONDITIONING , 1974, Scholarpedia.

[42]  J. Crotts Why Choose this Book? How we make decisions , 2008 .

[43]  T. Maia Reinforcement learning, conditioning, and the brain: Successes and challenges , 2009, Cognitive, affective & behavioral neuroscience.

[44]  Yoshua Bengio,et al.  Alternative time representation in dopamine models , 2009, Journal of Computational Neuroscience.

[45]  R. Schmidt,et al.  Striatal action-learning based on dopamine concentration , 2009, Experimental Brain Research.

[46]  Douglas A. Williams,et al.  Interstimulus interval and delivery cues influence timed conditioned responding in rats , 2009 .