On-policy concurrent reinforcement learning

When an agent learns in a multi-agent environment, the payoff it receives is dependent on the behaviour of the other agents. If the other agents are also learning, its reward distribution becomes non-stationary. This makes learning in multi-agent systems more difficult than single-agent learning. Prior attempts at value-function based learning in such domains have used off-policy Q-learning that do not scale well as the cornerstone, with restricted success. This paper studies on-policy modifications of such algorithms, with the promise of scalability and efficiency. In particular, it is proven that these hybrid techniques are guaranteed to converge to their desired fixed points under some restrictions. It is also shown, experimentally, that the new techniques can learn (from self-play) better policies than the previous algorithms (also in self-play) during some phases of the exploration.

[1]  Jürgen Schmidhuber,et al.  Fast Online Q(λ) , 1998, Machine Learning.

[2]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[3]  Bikramjit Banerjee,et al.  Fast Concurrent Reinforcement Learners , 2001, IJCAI.

[4]  Michael H. Bowling,et al.  Convergence Problems of General-Sum Multiagent Reinforcement Learning , 2000, ICML.

[5]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[6]  J. Nash NON-COOPERATIVE GAMES , 1951, Classics in Game Theory.

[7]  Eric van Damme,et al.  Non-Cooperative Games , 2000 .

[8]  Gavin Adrian Rummery Problem solving with reinforcement learning , 1995 .

[9]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[10]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[11]  Geoffrey J. Gordon Reinforcement Learning with Function Approximation Converges to a Region , 2000, NIPS.

[12]  O. Mangasarian,et al.  Two-person nonzero-sum games and quadratic programming , 1964 .

[13]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[14]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[15]  Tuomas Sandholm,et al.  On Multiagent Q-Learning in a Semi-Competitive Domain , 1995, Adaption and Learning in Multi-Agent Systems.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[18]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[19]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[20]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[21]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[22]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[23]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..