论文信息 - Bias-corrected Q-learning to control max-operator bias in Q-learning

Bias-corrected Q-learning to control max-operator bias in Q-learning

We identify a class of stochastic control problems with highly random rewards and high discount factor which induce high levels of statistical error in the estimated action-value function. This produces significant levels of max-operator bias in Q-learning, which can induce the algorithm to diverge for millions of iterations. We present a bias-corrected Q-learning algorithm with asymptotically unbiased resistance against the max-operator bias, and show that the algorithm asymptotically converges to the optimal policy, as Q-learning does. We show experimentally that bias-corrected Q-learning performs well in a domain with highly random rewards where Q-learning and other related algorithms suffer from the max-operator bias.

[1] C. Watkins. Learning from delayed rewards , 1989 .

[2] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[3] John N. Tsitsiklis,et al. Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[4] Csaba Szepesvári,et al. The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.

[5] Michael Kearns,et al. Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[6] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[7] Jong-Hwan Kim,et al. Modular Q-learning based multi-agent cooperation for robot soccer , 2001, Robotics Auton. Syst..

[8] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[9] Jiang Chen,et al. An application in RoboCup combining Q-learning with adversarial planning , 2002, Proceedings of the 4th World Congress on Intelligent Control and Automation (Cat. No.02EX527).

[10] Steve Young,et al. Automatic learning of dialogue strategy using dialogue simulation and reinforcement learning , 2002 .

[11] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[12] Jeffrey O. Kephart,et al. Pricing in Agent Economies Using Multi-Agent Q-Learning , 2002, Autonomous Agents and Multi-Agent Systems.

[13] Thore Graepel,et al. LEARNING TO FIGHT , 2004 .

[14] Cheng-Wan An,et al. Mobile robot navigation using neural Q-learning , 2004, Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.04EX826).

[15] Yi-Chi Wang,et al. Application of reinforcement learning for agent-based production scheduling , 2005, Eng. Appl. Artif. Intell..

[16] Liming Xiang,et al. Kernel-Based Reinforcement Learning , 2006, ICIC.

[17] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.

[18] Vikram Krishnamurthy,et al. ${Q}$-Learning Algorithms for Constrained Markov Decision Processes With Randomized Monotone Policies: Application to MIMO Transmission Control , 2007, IEEE Transactions on Signal Processing.

[19] Hado van Hasselt,et al. Double Q-learning , 2010, NIPS.

[20] Hilbert J. Kappen,et al. Speedy Q-Learning , 2011, NIPS.

[21] Warren B. Powell,et al. An Intelligent Battery Controller Using Bias-Corrected Q-learning , 2012, AAAI.

[22] Michèle Sebag,et al. The grand challenge of computer Go , 2012, Commun. ACM.