Bias-corrected Q-learning to control max-operator bias in Q-learning

We identify a class of stochastic control problems with highly random rewards and high discount factor which induce high levels of statistical error in the estimated action-value function. This produces significant levels of max-operator bias in Q-learning, which can induce the algorithm to diverge for millions of iterations. We present a bias-corrected Q-learning algorithm with asymptotically unbiased resistance against the max-operator bias, and show that the algorithm asymptotically converges to the optimal policy, as Q-learning does. We show experimentally that bias-corrected Q-learning performs well in a domain with highly random rewards where Q-learning and other related algorithms suffer from the max-operator bias.

[1]  C. Watkins Learning from delayed rewards , 1989 .

[2]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[3]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[4]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[5]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[6]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[7]  Jong-Hwan Kim,et al.  Modular Q-learning based multi-agent cooperation for robot soccer , 2001, Robotics Auton. Syst..

[8]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[9]  Jiang Chen,et al.  An application in RoboCup combining Q-learning with adversarial planning , 2002, Proceedings of the 4th World Congress on Intelligent Control and Automation (Cat. No.02EX527).

[10]  Steve Young,et al.  Automatic learning of dialogue strategy using dialogue simulation and reinforcement learning , 2002 .

[11]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12]  Jeffrey O. Kephart,et al.  Pricing in Agent Economies Using Multi-Agent Q-Learning , 2002, Autonomous Agents and Multi-Agent Systems.

[13]  Thore Graepel,et al.  LEARNING TO FIGHT , 2004 .

[14]  Cheng-Wan An,et al.  Mobile robot navigation using neural Q-learning , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[15]  Yi-Chi Wang,et al.  Application of reinforcement learning for agent-based production scheduling , 2005, Eng. Appl. Artif. Intell..

[16]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[17]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[18]  Vikram Krishnamurthy,et al.  ${Q}$-Learning Algorithms for Constrained Markov Decision Processes With Randomized Monotone Policies: Application to MIMO Transmission Control , 2007, IEEE Transactions on Signal Processing.

[19]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[20]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[21]  Warren B. Powell,et al.  An Intelligent Battery Controller Using Bias-Corrected Q-learning , 2012, AAAI.

[22]  Michèle Sebag,et al.  The grand challenge of computer Go , 2012, Commun. ACM.