论文信息 - Learning and Coordinating Repertoires of Behaviors with Common Reward: Credit Assignment and Module Activation

Learning and Coordinating Repertoires of Behaviors with Common Reward: Credit Assignment and Module Activation

Understanding extended natural behavior will require a theoretical understanding of the entire system as it is engaged in perception and action involving multiple concurrent goals such as foraging for different foods while avoiding different predators and looking for a mate. A promising way to do so is reinforcement learning (RL) as it considers in a very general way the problem of choosing actions in order to maximize a measure of cumulative benefits through some form of learning, and many connections between RL and animal learning have been established. Within this framework, we consider the problem faced by a single agent comprising multiple separate elemental task learners that we call modules, which jointly learn to solve tasks that arise as different combinations of concurrent individual tasks across episodes. While sometimes the goal may be to collect different types of food, at other times avoidance of several predators may be required. The individual modules have separate state representations, i.e. they obtain different inputs but have to carry out actions jointly in the common action space of the agent. Only a single measure of success is observed, which is the sum of the reward contributions from all component tasks. We provide a computational solution for learning elemental task solutions as they contribute to composite goals and a solution for how to learn to schedule these modules for different composite tasks across episodes. The algorithm learns to choose the appropriate modules for a particular task and solves the problem of calculating each module’s contribution to the total reward. The latter calculation works by combining current reward estimates with an error signal resulting from the difference between the global reward and the sum of reward estimates of other co-active modules. As the modules interact through their action value estimates, action selection is based on their composite contribution to individual task combinations. The algorithm learns good action value functions for component tasks and task combinations which is demonstrated on small classical problems and a more complex visuomotor navigation task.

Dana H. Ballard | Constantin A. Rothkopf | D. Ballard | C. Rothkopf

[1] Stewart W. Wilson,et al. From Animals to Animats 5. Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior , 1997 .

[2] P. Glimcher,et al. The Neurobiology of Decision: Consensus and Controversy , 2009, Neuron.

[3] Shobha Venkataraman,et al. Efficient Solution Algorithms for Factored MDPs , 2003, J. Artif. Intell. Res..

[4] Satinder P. Singh,et al. How to Dynamically Merge Markov Decision Processes , 1997, NIPS.

[5] Geoffrey E. Hinton,et al. Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[6] Dana H. Ballard,et al. Modular models of task based visually guided behavior , 2009 .

[7] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[8] Jonas Karlsson,et al. Learning to Solve Multiple Goals , 1997 .

[9] Shie Mannor,et al. A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[10] Dana H. Ballard,et al. Credit Assignment in Multiple Goal Embodied Visuomotor Behavior , 2010, Front. Psychology.

[11] Geoffrey E. Hinton,et al. Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[12] Mitsuo Kawato,et al. Multiple Model-Based Reinforcement Learning , 2002, Neural Computation.

[13] Rodney A. Brooks,et al. A Robust Layered Control Syste For A Mobile Robot , 2022 .

[14] N. Daw,et al. Human Reinforcement Learning Subdivides Structured Action Spaces by Learning Effector-Specific Values , 2009, The Journal of Neuroscience.

[15] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[16] Mark B. Ring. Continual learning in reinforcement environments , 1995, GMD-Bericht.

[17] J. Fodor. The Modularity of mind. An essay on faculty psychology , 1986 .

[18] Geoffrey E. Hinton,et al. Feudal Reinforcement Learning , 1992, NIPS.

[19] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20] Leslie Pack Kaelbling,et al. All learning is Local: Multi-agent Learning in Global Reward Games , 2003, NIPS.

[21] Nikos A. Vlassis,et al. Sparse cooperative Q-learning , 2004, ICML.

[22] D. Ballard,et al. Memory Representations in Natural Tasks , 1995, Journal of Cognitive Neuroscience.

[23] Leslie Pack Kaelbling,et al. Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[24] P. Dayan,et al. Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control , 2005, Nature Neuroscience.

[25] Kee-Eung Kim,et al. Solving Very Large Weakly Coupled Markov Decision Processes , 1998, AAAI/IAAI.

[26] Rajesh P. N. Rao,et al. Embodiment is the foundation, not a level , 1996, Behavioral and Brain Sciences.

[27] Michael F. Land,et al. From eye movements to actions: how batsmen hit the ball , 2000, Nature Neuroscience.

[28] M. Minsky. The Society of Mind , 1986 .

[29] A. L. I︠A︡rbus. Eye Movements and Vision , 1967 .

[30] Dana H. Ballard,et al. Multiple-Goal Reinforcement Learning with Modular Sarsa(0) , 2003, IJCAI.

[31] Sriraam Natarajan,et al. Dynamic preferences in multi-criteria reinforcement learning , 2005, ICML.

[32] Peter Dayan,et al. A Neural Substrate of Prediction and Reward , 1997, Science.

[33] K. Doya,et al. Representation of Action-Specific Reward Values in the Striatum , 2005, Science.

[34] Csaba Szepesvári,et al. Multi-criteria Reinforcement Learning , 1998, ICML.

[35] K. Doya,et al. The computational neurobiology of learning and reward , 2006, Current Opinion in Neurobiology.

[36] Stuart J. Russell,et al. Q-Decomposition for Reinforcement Learning Agents , 2003, ICML.

[37] H. Barrett,et al. Modularity in cognition: framing the debate. , 2006, Psychological review.

[38] S. Ullman. Visual routines , 1984, Cognition.

[39] Jochen Triesch,et al. Scalable reinforcement learning through hierarchical decompositions for weakly-coupled problems , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[40] Maja J. Matarić,et al. Action Selection methods using Reinforcement Learning , 1996 .

[41] A. L. Yarbus,et al. Eye Movements and Vision , 1967, Springer US.

[42] S. Pinker. How the Mind Works , 1999, Annals of the New York Academy of Sciences.

[43] J. Neumann,et al. Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[44] E. Vaadia,et al. Midbrain dopamine neurons encode decisions for future action , 2006, Nature Neuroscience.

[45] Andrew W. Moore,et al. Distributed Value Functions , 1999, ICML.

[46] Dana H. Ballard,et al. Modeling embodied visual behaviors , 2007, TAP.