Strategic best-response learning in multiagent systems

We present a novel and uniform formulation of the problem of reinforcement learning against bounded memory adaptive adversaries in repeated games, and the methodologies to accomplish learning in this novel framework. First we delineate a novel strategic definition of best response that optimises rewards over multiple steps, as opposed to the notion of tactical best response in game theory. We show that the problem of learning a strategic best response reduces to that of learning an optimal policy in a Markov Decision Process (MDP). We deal with both finite and infinite horizon versions of this problem. We adapt an existing Monte Carlo based algorithm for learning optimal policies in such MDPs over finite horizon, in polynomial time. We show that this new efficient algorithm can obtain higher average rewards than a previously known efficient algorithm against some opponents in the contract game. Though this improvement comes at the cost of increased domain knowledge, simple experiments in the Prisoner's Dilemma, and coordination games show that even when no extra domain knowledge (besides that an upper bound on the opponent's memory size is known) is assumed, the error can still be small. We also experiment with a general infinite-horizon learner (using function-approximation to tackle the complexity of history space) against a greedy bounded memory opponent and show that while it can create and exploit opportunities of mutual cooperation in the Prisoner's Dilemma game, it is cautious enough to ensure minimax payoffs in the Rock–Scissors–Paper game.

[1]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[3]  Gerald Tesauro,et al.  Extending Q-Learning to General Adaptive Multi-Agent Systems , 2003, NIPS.

[4]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[5]  J. Albus A Theory of Cerebellar Function , 1971 .

[6]  Ronitt Rubinfeld,et al.  Efficient algorithms for learning to play repeated games against computationally bounded adversaries , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[7]  Lance Fortnow,et al.  Optimality and domination in repeated games with bounded players , 1993, STOC '94.

[8]  Claude-Nicolas Fiechter Expected Mistake Bound Model for On-Line Reinforcement Learning , 1997, ICML.

[9]  Yoav Shoham,et al.  Polynomial-time reinforcement learning of near-optimal policies , 2002, AAAI/IAAI.

[10]  Bikramjit Banerjee,et al.  Efficient No-Regret Multiagent Learning , 2005, AAAI.

[11]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[12]  W. Hamilton,et al.  The evolution of cooperation. , 1984, Science.

[13]  Keith B. Hall,et al.  Correlated Q-Learning , 2003, ICML.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[16]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[17]  David Carmel,et al.  How to explore your opponent's strategy (almost) optimally , 1998, Proceedings International Conference on Multi Agent Systems (Cat. No.98EX160).

[18]  David Carmel,et al.  Model-based learning of interaction strategies in multi-agent systems , 1998, J. Exp. Theor. Artif. Intell..

[19]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[20]  Yoav Shoham,et al.  New Criteria and a New Algorithm for Learning in Multi-Agent Systems , 2004, NIPS.

[21]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[22]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[23]  Yoav Shoham,et al.  Learning against opponents with bounded memory , 2005, IJCAI.

[24]  David Carmel,et al.  Learning Models of Intelligent Agents , 1996, AAAI/IAAI, Vol. 1.

[25]  Bikramjit Banerjee,et al.  Performance Bounded Reinforcement Learning in Strategic Interactions , 2004, AAAI.

[26]  Bikramjit Banerjee,et al.  Efficient learning of multi-step best response , 2005, AAMAS '05.

[27]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[28]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[29]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.