Reinforcement learning with replacing eligibility traces

The eligibility trace is one of the basic mechanisms used in reinforcement learning to handle delayed reward. In this paper we introduce a new kind of eligibility trace, thereplacing trace, analyze it theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds of trace assign credit to prior events according to how recently they occurred, but only the conventional trace gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods converge under repeated presentations of the training set to the same predictions as two well known Monte Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the long run. Computational results confirm these analyses and show that they are applicable more generally. In particular, we show that replacing traces significantly improve performance and reduce parameter sensitivity on the "Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using a feature-based function approximator.

[1]  W. Wasow A note on the inversion of matrices by random walks , 1952 .

[2]  A. H. Klopf,et al.  Brain Function and Adaptive Systems: A Heterostatic Theory , 1972 .

[3]  Allen Van Gelder,et al.  Computer Algorithms: Introduction to Design and Analysis , 1978 .

[4]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[5]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[6]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[7]  John H. Holland,et al.  Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .

[8]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[9]  W. T. Miller,et al.  CMAC: an associative neural network alternative to backpropagation , 1990, Proc. IEEE.

[10]  A. Moore Variable Resolution Dynamic Programming , 1991, ML.

[11]  Etienne Barnard,et al.  Temporal-difference methods and Markov models , 1993, IEEE Trans. Syst. Man Cybern..

[12]  Andrew G. Barto,et al.  Monte Carlo Matrix Inversion and Reinforcement Learning , 1993, NIPS.

[13]  Richard S. Sutton,et al.  Online Learning with Random Representations , 1993, ICML.

[14]  Peter Dayan,et al.  Improving Generalization for Temporal Difference Learning: The Successor Representation , 1993, Neural Computation.

[15]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[16]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[17]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[18]  Richard S. Sutton,et al.  TD Models: Modeling the World at a Mixture of Time Scales , 1995, ICML.

[19]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[20]  Stewart W. Wilson Classifier Fitness Based on Accuracy , 1995, Evolutionary Computation.

[21]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[22]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[23]  Gerald Tesauro,et al.  Practical issues in temporal difference learning , 1992, Machine Learning.

[24]  Terrence J. Sejnowski,et al.  TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[25]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[26]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[27]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.