Coordinated Reinforcement Learning

We present several new algorithms for multiagent reinforcement learning. A common feature of these algorithms is a parameterized, structured representation of a policy or value function. This structure is leveraged in an approach we call coordinated reinforcement learning, by which agents coordinate both their action selection activities and their parameter updates. Within the limits of our parametric representations, the agents will determine a jointly optimal action without explicitly considering every possible action in their exponentially large joint action space. Our methods differ from many previous reinforcement learning approaches to multiagent coordination in that structured communication and coordination between agents appears at the core of both the learning algorithm and the execution architecture.

[1]  Umberto Bertelè,et al.  Nonserial Dynamic Programming , 1972 .

[2]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[3]  Frank Jensen,et al.  From Influence Diagrams to junction Trees , 1994, UAI.

[4]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5]  Andrew G. Barto,et al.  Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[6]  Rina Dechter,et al.  Bucket Elimination: A Unifying Framework for Reasoning , 1999, Artif. Intell..

[7]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[8]  Kagan Tumer,et al.  General principles of learning-based multi-agent systems , 1999, AGENTS '99.

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Andrew W. Moore,et al.  Distributed Value Functions , 1999, ICML.

[11]  Andrew Y. Ng,et al.  Policy Search via Density Estimation , 1999, NIPS.

[12]  Neil Immerman,et al.  The Complexity of Decentralized Control of Markov Decision Processes , 2000, UAI.

[13]  Peter L. Bartlett,et al.  Reinforcement Learning in POMDP's via Direct Gradient Ascent , 2000, ICML.

[14]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[15]  Carlos Guestrin,et al.  Max-norm Projections for Factored MDPs , 2001, IJCAI.

[16]  Michail G. Lagoudakis,et al.  Model-Free Least-Squares Policy Iteration , 2001, NIPS.

[17]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.