论文信息 - Reducing policy degradation in neuro-dynamic programming

Reducing policy degradation in neuro-dynamic programming

We focus on neuro-dynamic programming methods to learn state-action value functions and outline some of the inherent problems to be faced, when performing reinforcement learning in combination with function approximation. In an attempt to overcome some of these problems, we develop a reinforcement learning method that monitors the learning process, enables the learner to reflect whether it is better to cease learning, and thus obtains more stable learning results.

Martin A. Riedmiller | Thomas Gabel | T. Gabel

[1] Geoffrey J. Gordon. Stable Fitted Reinforcement Learning , 1995, NIPS.

[2] 李幼升,et al. Ph , 1989 .

[3] Peter Dayan,et al. Q-learning , 1992, Machine Learning.

[4] Michael Pinedo,et al. Scheduling: Theory, Algorithms, and Systems , 1994 .

[5] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[6] Dimitri P. Bertsekas,et al. Missile defense and interceptor allocation by neuro-dynamic programming , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[7] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8] Chris Gaskett,et al. Q-Learning for Robot Control , 2002 .

[9] Martin A. Riedmiller,et al. A Neural Reinforcement Learning Approach to Learn Local Dispatching Policies in Production Scheduling , 1999, IJCAI.

[10] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[11] Dimitri P. Bertsekas,et al. Missile Defense and Interceptor Allocation by , 2000 .