Retrospective model-based inference guides model-free credit assignment

An extensive reinforcement learning literature shows that organisms assign credit efficiently, even under conditions of state uncertainty. However, little is known about credit-assignment when state uncertainty is subsequently resolved. Here, we address this problem within the framework of an interaction between model-free (MF) and model-based (MB) control systems. We present and support experimentally a theory of MB retrospective-inference. Within this framework, a MB system resolves uncertainty that prevailed when actions were taken thus guiding an MF credit-assignment. Using a task in which there was initial uncertainty about the lotteries that were chosen, we found that when participants’ momentary uncertainty about which lottery had generated an outcome was resolved by provision of subsequent information, participants preferentially assigned credit within a MF system to the lottery they retrospectively inferred was responsible for this outcome. These findings extend our knowledge about the range of MB functions and the scope of system interactions.The reinforcement learning literature suggests decisions are based on a model-free system, operating retrospectively, and a model-based system, operating prospectively. Here, the authors show that a model-based retrospective inference of a reward’s cause, guides model-free credit-assignment.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  B. Balleine,et al.  Human and Rodent Homologies in Action Control: Corticostriatal Determinants of Goal-Directed and Habitual Action , 2010, Neuropsychopharmacology.

[3]  P. Dayan,et al.  Adaptive integration of habits into depth-limited planning defines a habitual-goal–directed spectrum , 2016, Proceedings of the National Academy of Sciences.

[4]  Christopher D. Adams,et al.  Instrumental Responding following Reinforcer Devaluation , 1981 .

[5]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[6]  Shinsuke Shimojo,et al.  Neural Computations Underlying Arbitration between Model-Based and Model-free Learning , 2013, Neuron.

[7]  Adam Kepecs,et al.  Midbrain Dopamine Neurons Signal Belief in Choice Accuracy during a Perceptual Decision , 2017, Current Biology.

[8]  Rani Moran,et al.  Old processes, new perspectives: Familiarity is correlated with (not independent of) recollection and is more (not equally) variable for targets than for lures , 2015, Cognitive Psychology.

[9]  N. Parga,et al.  Dopamine reward prediction error signal codes the temporal evaluation of a perceptual decision report , 2017, Proceedings of the National Academy of Sciences.

[10]  S. Gershman,et al.  Dopamine reward prediction errors reflect hidden state inference across time , 2017, Nature Neuroscience.

[11]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[12]  Keiji Tanaka,et al.  Matching Categorical Object Representations in Inferior Temporal Cortex of Man and Monkey , 2008, Neuron.

[13]  Stefan Bode,et al.  Intrinsic Valuation of Information in Decision Making under Uncertainty , 2016, PLoS Comput. Biol..

[14]  B. Balleine,et al.  The role of the dorsomedial striatum in instrumental conditioning , 2005, The European journal of neuroscience.

[15]  R. Rescorla A theory of pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement , 1972 .

[16]  Peter Dayan,et al.  Dopamine: generalization and bonuses , 2002, Neural Networks.

[17]  S. S. Stevens,et al.  Handbook of experimental psychology , 1951 .

[18]  Amir Dezfouli,et al.  Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes , 2011, PLoS Comput. Biol..

[19]  S. Killcross,et al.  Coordination of actions and habits in the medial prefrontal cortex of rats. , 2003, Cerebral cortex.

[20]  Holly C. Miller,et al.  Preference for 50% reinforcement over 75% reinforcement by pigeons , 2009, Learning & behavior.

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  Ethan S. Bromberg-Martin,et al.  Lateral habenula neurons signal errors in the prediction of reward information , 2011, Nature Neuroscience.

[23]  B. Balleine,et al.  Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning , 2004, The European journal of neuroscience.

[24]  Dylan A. Simon,et al.  Model-based choices involve prospective neural activity , 2015, Nature Neuroscience.

[25]  P. Dayan,et al.  Model-based influences on humans’ choices and striatal prediction errors , 2011, Neuron.

[26]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[27]  Marco Vasconcelos,et al.  Irrational choice and the value of information , 2015, Scientific Reports.

[28]  Thomas H. B. FitzGerald,et al.  Disruption of Dorsolateral Prefrontal Cortex Decreases Model-Based in Favor of Model-free Control in Humans , 2013, Neuron.

[29]  Roger Ratcliff,et al.  Assessing model mimicry using the parametric bootstrap , 2004 .

[30]  P. Dayan,et al.  States versus Rewards: Dissociable Neural Prediction Error Signals Underlying Model-Based and Model-Free Reinforcement Learning , 2010, Neuron.

[31]  S. Gershman,et al.  Belief state representation in the dopamine system , 2018, Nature Communications.

[32]  Jessica P. Stagner,et al.  Maladaptive choice behaviour by pigeons: an animal analogue and possible mechanism for gambling (sub-optimal human decision-making behaviour) , 2011, Proceedings of the Royal Society B: Biological Sciences.

[33]  F. Cushman,et al.  Habitual control of goal selection in humans , 2015, Proceedings of the National Academy of Sciences.

[34]  Nikos K. Logothetis,et al.  Frontiers in Computational Neuroscience Computational Neuroscience , 2022 .

[35]  Keiji Tanaka,et al.  Object category structure in response patterns of neuronal population in monkey inferior temporal cortex. , 2007, Journal of neurophysiology.

[36]  David S. Touretzky,et al.  Representation and Timing in Theories of the Dopamine System , 2006, Neural Computation.

[37]  Kenji Doya,et al.  What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? , 1999, Neural Networks.

[38]  P. Dayan,et al.  Goals and Habits in the Brain , 2013, Neuron.

[39]  P. Dayan,et al.  Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[40]  A. Markman,et al.  Journal of Experimental Psychology : General Retrospective Revaluation in Sequential Decision Making : A Tale of Two Systems , 2012 .

[41]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[42]  Zeb Kurth-Nelson,et al.  The modulation of savouring by prediction error and its effects on choice , 2016, eLife.

[43]  Vivian V. Valentin,et al.  Determining the Neural Substrates of Goal-Directed Learning in the Human Brain , 2007, The Journal of Neuroscience.

[44]  Rajesh P. N. Rao Decision Making Under Uncertainty: A Neural Model Based on Partially Observable Markov Decision Processes , 2010, Front. Comput. Neurosci..