Efficient learning of multi-step best response

We provide a uniform framework for learning against a recent history adversary in arbitrary repeated bimatrix games, by modeling such an agent as a Markov Decision Process. We focus on learning an optimal non-stationary policy in such an MDP over a finite horizon and adapt an existing efficient Monte Carlo based algorithm for learning optimal policies in such MDPs. We show that this new efficient algorithm can obtain higher average rewards than a previously known efficient algorithm against some opponents in the contract game. Though this improvement comes at the cost of increased domain knowledge, a simple experiment in the Prisoner's Dilemma game shows that even when no extra domain knowledge (besides that the opponent's memory size is known) is assumed, the error can still be small.

[1]  Yishay Mansour,et al.  Approximate Planning in Large POMDPs via Reusable Trajectories , 1999, NIPS.

[2]  W. Hamilton,et al.  The evolution of cooperation. , 1984, Science.

[3]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[4]  Bikramjit Banerjee,et al.  Performance Bounded Reinforcement Learning in Strategic Interactions , 2004, AAAI.

[5]  David Carmel,et al.  Learning Models of Intelligent Agents , 1996, AAAI/IAAI, Vol. 1.

[6]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[7]  Gerald Tesauro,et al.  Extending Q-Learning to General Adaptive Multi-Agent Systems , 2003, NIPS.

[8]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[9]  David Carmel,et al.  How to explore your opponent's strategy (almost) optimally , 1998, Proceedings International Conference on Multi Agent Systems (Cat. No.98EX160).

[10]  David Carmel,et al.  Model-based learning of interaction strategies in multi-agent systems , 1998, J. Exp. Theor. Artif. Intell..

[11]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Ronitt Rubinfeld,et al.  Efficient algorithms for learning to play repeated games against computationally bounded adversaries , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[14]  Lance Fortnow,et al.  Optimality and domination in repeated games with bounded players , 1993, STOC '94.

[15]  Yoav Shoham,et al.  Polynomial-time reinforcement learning of near-optimal policies , 2002, AAAI/IAAI.

[16]  Manuela M. Veloso,et al.  Multiagent learning using a variable learning rate , 2002, Artif. Intell..

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[18]  Claude-Nicolas Fiechter Expected Mistake Bound Model for On-Line Reinforcement Learning , 1997, ICML.

[19]  Yishay Mansour,et al.  Nash Convergence of Gradient Dynamics in General-Sum Games , 2000, UAI.

[20]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[21]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.