Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions

Consider a given value function on states of a Markov decision problem, as might result from applying a reinforcement learning algorithm. Unless this value function equals the corresponding optimal value function , at some states there will be a discrepancy, which is natural to call the Bellman residual, between what the value function speciies at that state and what is obtained by a one-step lookahead along the seemingly best action at that state using the given value function to evaluate all succeeding states. This paper derives a tight bound on how far from optimal the discounted return for a greedy policy based on the given value function will be as a function of the maximum norm magnitude of this Bellman residual. A corresponding result is also obtained for value functions deened on state-action pairs, as are used in Q-learning. One signiicant application of these results is to problems where a function approximator is used to learn a value function, with training of the approxi-mator based on trying to minimize the Bellman residual across states or state-action pairs. When control is based on the use of the resulting value function, this result provides a link between how well the objectives of function approximator training are met and the quality of the resulting control.