论文信息 - Learning to Cooperate via Policy Search

Learning to Cooperate via Policy Search

Cooperative games are those in which both agents share the same payoff structure. Value-based reinforcement-learning algorithms, such as variants of Q-learning, have been applied to learning cooperative games, but they only apply when the game state is completely observable to both agents. Policy search methods are a reasonable alternative to value-based methods for partially observable environments. In this paper, we provide a gradient-based distributed policy-search method for cooperative games and compare the notion of local optimum to that of Nash equilibrium. We demonstrate the effectiveness of this method experimentally in a small, partially observable simulated soccer domain.

[1] Kenneth J. Arrow,et al. Stability of the Gradient Process in n-Person Games , 1960 .

[2] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[3] Edward J. Sondik,et al. The Optimal Control of Partially Observable Markov Processes over the Infinite Horizon: Discounted Costs , 1978, Oper. Res..

[4] Michael L. Littman,et al. Memoryless policies: theoretical limitations and practical results , 1994 .

[5] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6] Michael I. Jordan,et al. Learning Without State-Estimation in Partially Observable Markovian Decision Processes , 1994, ICML.

[7] Ariel Rubinstein,et al. A Course in Game Theory , 1995 .

[8] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[9] Craig Boutilier,et al. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[10] Michael P. Wellman,et al. Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[11] Andrew W. Moore,et al. Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[12] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[13] L. Baird. Reinforcement Learning Through Gradient Descent , 1999 .

[14] Craig Boutilier,et al. Sequential Optimality and Coordination in Multiagent Systems , 1999, IJCAI.

[15] Leslie Pack Kaelbling,et al. Learning Policies with External Memory , 1999, ICML.

[16] Andrew W. Moore,et al. Distributed Value Functions , 1999, ICML.

[17] Kee-Eung Kim,et al. Learning Finite-State Controllers for Partially Observable Environments , 1999, UAI.

[18] Neil Immerman,et al. The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.