Reinforcement Learning: Past, Present and Future

Reinforcement learning (RL) concerns the problem of a learning agent inter- acting with its environment to achieve a goal. Instead of being given examples of desired behavior, the learning agent must discover by trial and error how to be- have in order to get the most reward. RL has become popular as an approach to artificial intelligence because of its simple algorithms and mathematical founda- tions (Watkins, 1989; Sutton, 1988; Bertsekas and Tsitsiklis, 1996) and because of a string of strikingly successful applications (e.g., Tesauro, 1995; Crites and Barto, 1996; Zhang and Dietterich, 1996; Nie and Haykin, 1996; Singh and Bert- sekas, 1997; Baxter, Tridgell, and Weaver, 1998). An overall introduction to the field is provided by a recent textbook (Sutton and Barto, 1998). Here we summa- rize three stages in the development of the field, which we coarsely characterize as the past, present, and future of reinforcement learning.

[1]  C. Watkins Learning from delayed rewards , 1989 .

[2]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[3]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[4]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[5]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[6]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[9]  Doina Precup,et al.  Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[10]  R. Sutton Between MDPs and Semi-MDPs : Learning , Planning , and Representing Knowledge at Multiple Temporal Scales , 1998 .

[11]  Doina Precup,et al.  Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales , 1998 .

[12]  Jonathan Baxter KnightCap : A chess program that learns by combining TD ( ) with game-tree search , 1998 .

[13]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[14]  Andrew Tridgell,et al.  KnightCap: A Chess Programm That Learns by Combining TD(lambda) with Game-Tree Search , 1998, ICML.

[15]  Simon Haykin,et al.  A dynamic channel assignment policy through Q-learning , 1999, IEEE Trans. Neural Networks.