论文信息 - Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions

Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions

Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function , at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function speciies at that state and what is obtained by a one-step lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions deened on state-action pairs, as are used in Q-learning. One signiicant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi-mator based on trying to minimize the Bellman residual across states or state-action pairs. When control is based on the use of the resulting value function, this result provides a link between how well the objectives of function approximator training are met and the quality of the resulting control.

Ronald J. Williams | Leemon C. Baird | L. Baird | Ronald J. Williams

[1] Dimitri P. Bertsekas,et al. Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[2] Paul J. Werbos,et al. Consistency of HDP applied to a simple reinforcement learning problem , 1990, Neural Networks.

[3] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[4] Richard S. Sutton,et al. Planning by Incremental Dynamic Programming , 1991, ML.

[5] Steven J. Bradtke,et al. Reinforcement Learning Applied to Linear Quadratic Regulation , 1992, NIPS.

[6] J. Peng,et al. Efficient Learning and Planning Within the Dyna Framework , 1993, IEEE International Conference on Neural Networks.

[7] Andrew W. Moore,et al. Memory-based Reinforcement Learning: Converging with Less Data and Less Real Time , 1993 .

[8] Satinder Singh,et al. An Upper Bound on the Loss from Approximate Optimal-Value Functions , 2004, Machine-mediated learning.

[9] Ben J. A. Kröse,et al. Learning from delayed rewards , 1995, Robotics Auton. Syst..

[10] Peter Dayan,et al. Q-learning , 1992, Machine Learning.