R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning

R-MAX is a very simple model-based reinforcement learning algorithm which can attain near-optimal average reward in polynomial time. In R-MAX, the agent always maintains a complete, but possibly inaccurate model of its environment and acts based on the optimal policy derived from this model. The model is initialized in an optimistic fashion: all actions in all states return the maximal possible reward (hence the name). During execution, it is updated based on the agent's observations. R-MAX improves upon several previous algorithms: (1) It is simpler and more general than Kearns and Singh's E3 algorithm, covering zero-sum stochastic games. (2) It has a built-in mechanism for resolving the exploration vs. exploitation dilemma. (3) It formally justifies the ``optimism under uncertainty'' bias used in many RL algorithms. (4) It is simpler, more general, and more efficient than Brafman and Tennenholtz's LSG algorithm for learning in single controller stochastic games. (5) It generalizes the algorithm by Monderer and Tennenholtz for learning in repeated games. (6) It is the only algorithm for learning in repeated games, to date, which is provably efficient, considerably improving and simplifying previous algorithms by Banos and by Megiddo.

[1]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .

[2]  A. Banos On Pseudo-Games , 1968 .

[3]  N. Megiddo On repeated games with incomplete information played by non-Bayesian players , 1980 .

[4]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[5]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[6]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[7]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[8]  D. Fudenberg,et al.  Self-confirming equilibrium , 1993 .

[9]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[10]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[12]  Moshe Tennenholtz,et al.  Dynamic Non-Bayesian Decision Making , 1997, J. Artif. Intell. Res..

[13]  Prasad Tadepalli,et al.  Model-Based Average Reward Reinforcement Learning , 1998, Artif. Intell..

[14]  Michael P. Wellman,et al.  Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.

[15]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[16]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[17]  Ronen I. Brafman,et al.  A near-optimal polynomial time algorithm for learning in certain classes of stochastic games , 2000, Artif. Intell..

[18]  S. Hart,et al.  A Reinforcement Procedure Leading to Correlated Equilibrium , 2001 .

[19]  David B. Leake Artiicial Intelligence , 2001 .

[20]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[21]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.