TD(λ) Converges with Probability 1

The methods of temporal differences (Samuel, 1959; Sutton, 1984, 1988) allow an agent to learn accurate predictions of stationary stochastic future outcomes. The learning is effectively stochastic approximation based on samples extracted from the process generating the agent's future.Sutton (1988) proved that for a special case of temporal differences, the expected values of the predictions converge to their correct values, as larger samples are taken, and Dayan (1992) extended his proof to the general case. This article proves the stronger result than the predictions of a slightly modified form of temporal difference learning converge with probability one, and shows how to quantify the rate of convergence.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[2]  Tamio Shimizu,et al.  A Stochastic Approximation Method for Optimization Problems , 1969, Journal of the ACM.

[3]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[4]  V. Nollau Kushner, H. J./Clark, D. S., Stochastic Approximation Methods for Constrained and Unconstrained Systems. (Applied Mathematical Sciences 26). Berlin‐Heidelberg‐New York, Springer‐Verlag 1978. X, 261 S., 4 Abb., DM 26,40. US $ 13.20 , 1980 .

[5]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[6]  Harold J. Kushner,et al.  Approximation and Weak Convergence Methods for Random Processes , 1984 .

[7]  Bruce E. Hajek,et al.  Review of 'Approximation and Weak Convergence Methods for Random Processes, with Applications to Stochastic Systems Theory' (Kushner, H.J.; 1984) , 1985, IEEE Transactions on Information Theory.

[8]  C. Watkins Learning from delayed rewards , 1989 .

[9]  Halbert White,et al.  Recursive M-estimation, nonlinear regression and neural network learning with dependent observations , 1990 .

[10]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[11]  Gerald Tesauro,et al.  Practical Issues in Temporal Difference Learning , 1992, Mach. Learn..

[12]  G. Tesauro Practical Issues in Temporal Difference Learning , 1992 .

[13]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[14]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[15]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[16]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.