论文信息 - Convergence of Optimistic and Incremental Q-Learning

Convergence of Optimistic and Incremental Q-Learning

We show the convergence of two deterministic variants of Q-learning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an e-optimal policy. The second is a new and novel algorithm incremental Q-learning, which gradually promotes the values of actions that are not taken. We show that incremental Q-learning converges, in the limit, to the optimal policy Our incremental Q-learning algorithm can be viewed as derandomization of the e-greedy Q-learning.

Yishay Mansour | Eyal Even-Dar | Y. Mansour | Eyal Even-Dar

[1] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[2] Yishay Mansour,et al. Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[3] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[4] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[5] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[6] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8] John N. Tsitsiklis,et al. Asynchronous stochastic approximation and Q-learning , 1993, Proceedings of 32nd IEEE Conference on Decision and Control.

[9] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[10] Csaba Szepesvári,et al. A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[11] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[12] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .