Learning to Predict by the Methods of Temporal Differences

This article introduces a class of incremental learning procedures specialized for prediction – that is, for using past experience with an incompletely known system to predict its future behavior. Whereas conventional prediction-learning methods assign credit by means of the difference between predicted and actual outcomes, the new methods assign credit by means of the difference between temporally successive predictions. Although such temporal-difference methods have been used in Samuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuristic Critic, they have remained poorly understood. Here we prove their convergence and optimality for special cases and relate them to supervised-learning methods. For most real-world prediction problems, temporal-difference methods require less memory and less peak computation than conventional methods and they produce more accurate predictions. We argue that most problems to which supervised learning is currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.

[1]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1959, IBM J. Res. Dev..

[2]  J. Gillis,et al.  Matrix Iterative Analysis , 1961 .

[3]  E. Feigenbaum,et al.  Computers and Thought , 1963 .

[4]  John G. Kemeny,et al.  Finite Markov chains , 1976 .

[5]  Ian H. Witten,et al.  An Adaptive Optimal Controller for Discrete-Time Markov Environments , 1977, Information and Control.

[6]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[7]  Lashon B. Booker,et al.  Intelligent behavior as an adaptation to the task environment ; Part II. , 1982 .

[8]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Steven Edward Hampson,et al.  A neural model of adaptive behavior , 1983 .

[10]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[11]  Thomas G. Dietterich,et al.  Learning to Predict Sequences , 1985 .

[12]  J. Hopfield,et al.  The Logic of Limax Learning , 1985 .

[13]  A G Barto,et al.  Learning by statistical cooperation of self-interested neuron-like computing elements. , 1985, Human neurobiology.

[14]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[15]  J. Christensen Learning static evaluation functions by linear regression , 1986 .

[16]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[17]  Charles W. Anderson,et al.  Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning) , 1986 .

[18]  John H. Holland,et al.  Escaping brittleness: the possibilities of general-purpose learning algorithms applied to parallel rule-based systems , 1995 .

[19]  Richard E. Korf,et al.  A Unified Theory of Heuristic Evaluation Functions and its Application to Learning , 1986, AAAI.

[20]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[21]  R. Sutton,et al.  Simulation of the classically conditioned nictitating membrane response by a neuron-like adaptive element: Response topography, neuronal firing, and interstimulus intervals , 1986, Behavioural Brain Research.

[22]  Jaime G. Carbonell,et al.  Machine learning: a guide to current research , 1986 .

[23]  Charles W. Anderson,et al.  Strategy Learning with Multilayer Connectionist Representations , 1987 .

[24]  E. Kehoe,et al.  Temporal primacy overrides prior training in serial compound conditioning of the rabbit’s nictitating membrane response , 1987 .

[25]  Bart W. Stuck,et al.  A Computer and Communication Network Performance Analysis Primer (Prentice Hall, Englewood Cliffs, NJ, 1985; revised, 1987) , 1987, Int. CMG Conference.

[26]  A. Klopf A neuronal model of classical conditioning , 1988 .

[27]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[28]  Eric V. Denardo,et al.  Dynamic Programming: Models and Applications , 2003 .

[29]  S. Hampson,et al.  Disjunctive models of Boolean category learning , 1987, Biological Cybernetics.