论文信息 - Where Do Rewards Come From

Where Do Rewards Come From

Where Do Rewards Come From? Satinder Singh Richard L. Lewis Andrew G. Barto baveja@umich.edu Computer Science & Engineering University of Michigan, Ann Arbor rickl@umich.edu Department of Psychology University of Michigan, Ann Arbor barto@cs.umass.edu Department of Computer Science University of Massachusetts, Amherst Abstract Reinforcement learning has achieved broad and successful ap- plication in cognitive science in part because of its general for- mulation of the adaptive control problem as the maximization of a scalar reward function. The computational reinforcement learning framework is motivated by correspondences to ani- mal reward processes, but it leaves the source and nature of the rewards unspecified. This paper advances a general computa- tional framework for reward that places it in an evolutionary context, formulating a notion of an optimal reward function given a fitness function and some distribution of environments. Novel results from computational experiments show how tra- ditional notions of extrinsically and intrinsically motivated be- haviors may emerge from such optimal reward functions. In the experiments these rewards are discovered through auto- mated search rather than crafted by hand. The precise form of the optimal reward functions need not bear a direct relationship to the fitness function, but may nonetheless confer significant advantages over rewards based only on fitness. Introduction In the computational reinforcement learning (RL) frame- work (Sutton & Barto, 1998), rewards—more specifically, re- ward functions—determine the problem the learning agent is trying to solve. Properties of the reward function influence how easy or hard the problem is, and how well an agent may do, but RL theory and algorithms are completely insensitive to the source of rewards (except requiring that their magni- tude be bounded). This is a strength of the framework be- cause of the generality it confers, capable of encompassing both homeostatic theories of motivation in which rewards are defined as drive reduction, as has been done in many motiva- tional systems for artificial agents (Savage, 2000), and non- homeostatic theories that can account, for example, for the behavioral effects of electrical brain stimulation and addic- tive drugs. But it is also a weakness because it defers key questions about the nature of reward functions. Motivating the RL framework are the following correspon- dences to animal reward processes. Rewards in an RL system correspond to primary rewards, i.e., rewards that in animals have been hard-wired by the evolutionary process due to their relevance to reproductive success. In RL, they are thought of as the output of a “critic” that evaluates the RL agent’s be- havior. Further, RL systems that form value functions, us- ing, for example, Temporal Difference (TD) algorithms, ef- fectively create conditioned or secondary reward processes whereby predictors of primary rewards act as rewards them- selves. The learned value function provides ongoing evalu- ations that are consistent with the more intermittent evalu- ations of the hard-wired critic. The result is that the local landscape of a value function gives direction to the system’s preferred behavior: decisions are made to cause transitions to External Environment Actions Sensations Internal Environment Environment Critic Critic Rewards Decisions Rewards States States Actions Agent Agent Organism Figure 1: Agent-environment interactions in reinforcement learn- ing; adapted from Barto et al. (2004). See text for discussion. higher-valued states. A close parallel can be drawn between the gradient of a value function and incentive motivation (Mc- Clure, Daw, & Montague, 2003). In the usual view of an RL agent interacting with an ex- ternal environment (left panel of Figure 1), the primary re- ward comes from the external environment, being generated by a “critic” residing there. But as Sutton and Barto (1998) and Barto, Singh, and Chentanez (2004) point out, this is a seriously misleading view of RL if one wishes to relate this framework to animal reward systems. In a less misleading view of this interaction (right panel of Figure 1), the RL agent’s environment is divided into ex- ternal and internal environments. For an animal, the inter- nal environment consists of the systems that are internal to the animal while still being parts of the learning system’s environment. This view makes it clear that reward signals are always generated within the animal, for example, by its dopamine system. Therefore, all rewards are internal, and the internal/external distinction is not a useful one, a point also emphasized by Oudeyer and Kaplan (2007). This is the viewpoint we adopt in this paper. But what of a distinction between intrinsic and extrinsic reward? Psychologists distinguish between extrinsic motiva- tion, which means doing something because of some specific rewarding outcome, and intrinsic motivation, which refers to “doing something because it is inherently interesting or en- joyable” (Ryan & Deci, 2000). According to this view, in- trinsic motivation leads organisms to engage in exploration, play, and other behavior driven by curiosity in the absence of explicit reinforcement or reward. Barto et al. (2004) used the term intrinsic reward to re- fer to rewards that produce analogs of intrinsic motivation in RL agents, and extrinisic reward to refer to rewards that de- fine a specific task as in standard RL applications. We use this terminology here, but the distinction between intrinsic

Richard L. Lewis | Andrew G. Barto | Satinder Singh | Satinder Singh | A. Barto

[1] Edward L. Deci,et al. Intrinsic Motivation and Self-Determination in Human Behavior , 1975, Perspectives in Social Psychology.

[2] Jürgen Schmidhuber,et al. A possibility for implementing curiosity and boredom in model-building neural controllers , 1991 .

[3] David H. Ackley,et al. Interactions between learning and evolution , 1991 .

[4] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[5] Stanley J. Rosenschein,et al. From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior , 1996 .

[6] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[7] Tony Savage,et al. Artificial motives: A review of motivation in artificial creatures , 2000, Connect. Sci..

[8] E. Deci,et al. Intrinsic and Extrinsic Motivations: Classic Definitions and New Directions. , 2000, Contemporary educational psychology.

[9] Samuel M. McClure,et al. A computational substrate for incentive salience , 2003, Trends in Neurosciences.

[10] Nuttapong Chentanez,et al. Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[11] Nuttapong Chentanez,et al. Intrinsically Motivated Learning of Hierarchical Collections of Skills , 2004 .

[12] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13] Pierre-Yves Oudeyer,et al. What is Intrinsic Motivation? A Typology of Computational Approaches , 2007, Frontiers Neurorobotics.

[14] Kenji Doya,et al. Finding intrinsic rewards by embodied evolution and constrained reinforcement learning , 2008, Neural Networks.