论文信息 - Reinforcement Learning: Past, Present and Future

Reinforcement Learning: Past, Present and Future

Reinforcement learning (RL) concerns the problem of a learning agent inter- acting with its environment to achieve a goal. Instead of being given examples of desired behavior, the learning agent must discover by trial and error how to be- have in order to get the most reward. RL has become popular as an approach to artificial intelligence because of its simple algorithms and mathematical founda- tions (Watkins, 1989; Sutton, 1988; Bertsekas and Tsitsiklis, 1996) and because of a string of strikingly successful applications (e.g., Tesauro, 1995; Crites and Barto, 1996; Zhang and Dietterich, 1996; Nie and Haykin, 1996; Singh and Bert- sekas, 1997; Baxter, Tridgell, and Weaver, 1998). An overall introduction to the field is provided by a recent textbook (Sutton and Barto, 1998). Here we summa- rize three stages in the development of the field, which we coarsely characterize as the past, present, and future of reinforcement learning.

Richard S. Sutton | R. Sutton

[1] C. Watkins. Learning from delayed rewards , 1989 .

[2] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[3] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[4] Thomas G. Dietterich,et al. High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[5] Andrew G. Barto,et al. Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[6] Andrew McCallum,et al. Reinforcement learning with selective perception and hidden state , 1996 .

[7] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8] Dimitri P. Bertsekas,et al. Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[9] Doina Precup,et al. Multi-time Models for Temporally Abstract Planning , 1997, NIPS.

[10] R. Sutton. Between MDPs and Semi-MDPs : Learning , Planning , and Representing Knowledge at Multiple Temporal Scales , 1998 .

[11] Doina Precup,et al. Between MOPs and Semi-MOP: Learning, Planning & Representing Knowledge at Multiple Temporal Scales , 1998 .

[12] Jonathan Baxter. KnightCap : A chess program that learns by combining TD ( ) with game-tree search , 1998 .

[13] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[14] Andrew Tridgell,et al. KnightCap: A Chess Programm That Learns by Combining TD(lambda) with Game-Tree Search , 1998, ICML.

[15] Simon Haykin,et al. A dynamic channel assignment policy through Q-learning , 1999, IEEE Trans. Neural Networks.