论文信息 - Fast and Efficient Reinforcement Learning with Truncated Temporal Differences

Fast and Efficient Reinforcement Learning with Truncated Temporal Differences

Abstract The problem of temporal credit assignment in reinforcement learning is typically solved using algorithms based on the methods of temporal differences TD(λ). Of those, Q-learning is currently best understood and most widely used. Using TD-based algorithms with λ > 0 often allows one to speed up the propagation of credit significantly, but it involves certain implementational problems. The traditional implementation of TD(λ > 0) based on eligibility traces suffers from lack of generality and computational inefficiency. The TTD ( Truncated Temporal Differences) procedure is a simple TD(λ) approximation technique that appears to overcome these drawbacks of eligibility traces. The paper outlines this technique, discusses its computational efficiency advantages, and presents experimental studies with the combination of TTD and Q-learning in deterministic and stochastic environments. These experiments show that TTD makes it possible to obtain a significant learning speedup without reducing reliability at essentially the same computational cost as usual TD(0) learning. We conclude that the TTD procedure is probably the most promising way of using TD methods for reinforcement learning, especially for tasks with large state spaces and a hard temporal credit assignment problem.

Pawel Cichosz | Jan J. Mulawka

[1] Richard S. Sutton,et al. Temporal credit assignment in reinforcement learning , 1984 .

[2] Pawel Cichosz,et al. Truncating Temporal Differences: On the Efficient Implementation of TD(lambda) for Reinforcement Learning , 1994, J. Artif. Intell. Res..

[3] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[4] Richard S. Sutton,et al. Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[5] Jing Peng,et al. Incremental multi-step Q-learning , 1994, Machine-mediated learning.

[6] Gerald Tesauro,et al. Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[7] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..