Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function , at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function speciies at that state and what is obtained by a one-step lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions deened on state-action pairs, as are used in Q-learning. One signiicant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi-mator based on trying to minimize the Bellman residual across states or state-action pairs. When control is based on the use of the resulting value function, this result provides a link between how well the objectives of function approximator training are met and the quality of the resulting control.
[1]
Dimitri P. Bertsekas,et al.
Dynamic Programming: Deterministic and Stochastic Models
,
1987
.
[2]
Paul J. Werbos,et al.
Consistency of HDP applied to a simple reinforcement learning problem
,
1990,
Neural Networks.
[3]
Richard S. Sutton,et al.
Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming
,
1990,
ML.
[4]
Richard S. Sutton,et al.
Planning by Incremental Dynamic Programming
,
1991,
ML.
[5]
Steven J. Bradtke,et al.
Reinforcement Learning Applied to Linear Quadratic Regulation
,
1992,
NIPS.
[6]
J. Peng,et al.
Efficient Learning and Planning Within the Dyna Framework
,
1993,
IEEE International Conference on Neural Networks.
[7]
Andrew W. Moore,et al.
Memory-based Reinforcement Learning: Converging with Less Data and Less Real Time
,
1993
.
[8]
Satinder Singh,et al.
An Upper Bound on the Loss from Approximate Optimal-Value Functions
,
2004,
Machine-mediated learning.
[9]
Ben J. A. Kröse,et al.
Learning from delayed rewards
,
1995,
Robotics Auton. Syst..
[10]
Peter Dayan,et al.
Q-learning
,
1992,
Machine Learning.