论文信息 - Incremental multi-step Q-learning

Incremental multi-step Q-learning

This paper presents a novel incremental algorithm that combines Q-learning, a well-known dynamic-programming based reinforcement learning method, with the TD(λ) return estimation process, which is typically used in actor-critic learning, another well-known dynamic-programming based reinforcement learning method. The parameter λ is used to distribute credit throughout sequences of actions, leading to faster learning and also helping to alleviate the non-Markovian effect of coarse state-space quatization. The resulting algorithm.Q(λ)-learning, thus combines some of the best features of the Q-learning and actor-critic learning paradigms. The behavior of this algorithm has been demonstrated through computer simulations.

Jing Peng | Ronald J. Williams | Ronald J. Williams | Jing Peng

[1] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[3] C. Watkins. Learning from delayed rewards , 1989 .

[4] Paul J. Werbos,et al. Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[5] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[6] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[7] Jing Peng,et al. Efficient Learning and Planning Within the Dyna Framework , 1993, Adapt. Behav..

[8] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[9] Mark D. Pendrith. On Reinforcement Learning of Control Actions in Noisy and Non-Markovian Domains , 1994 .

[10] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[11] Pawel Cichosz,et al. Fast and Efficient Reinforcement Learning with Truncated Temporal Differences , 1995, ICML.

[12] Leslie Pack Kaelbling,et al. On reinforcement learning for robots , 1996, IROS.

[13] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[14] Peter Dayan,et al. The convergence of TD(λ) for general λ , 1992, Machine Learning.

[15] Andrew W. Moore,et al. Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.

[16] Peter Dayan,et al. Technical Note: Q-Learning , 2004, Machine Learning.

[17] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.