Finite-sample Analysis of Bellman Residual Minimization

We consider the Bellman residual minimization approach for solving discounted Markov decision problems, where we assume that a generative model of the dynamics and rewards is available. At each policy iteration step, an approximation of the value function for the current policy is obtained by minimizing an empirical Bellman residual defined on a set of n states drawn i.i.d. from a distribution , the immediate rewards, and the next states sampled from the model. Our main result is a generalization bound for the Bellman residual in linear approximation spaces. In particular, we prove that the empirical Bellman residual approaches the true (quadratic) Bellman residual in -norm with a rate of order O(1= √ n). This result implies that minimizing the empirical residual is indeed a sound approach for the minimization of the true Bellman residual which guarantees a good approximation of the value function for each policy. Finally, we derive performance bounds for the resulting approximate policy iteration algorithm in terms of the number of samples n and a measure of how well the function space is able to approximate the sequence of value functions.

[1]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[2]  Ronald J. Williams,et al.  Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[5]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[8]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[9]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[10]  Jennie Si,et al.  Handbook of Learning and Approximate Dynamic Programming (IEEE Press Series on Computational Intelligence) , 2004 .

[11]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[14]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[15]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[16]  Warren B. Powell,et al.  Approximate Dynamic Programming: Solving the Curses of Dimensionality (Wiley Series in Probability and Statistics) , 2007 .

[17]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[18]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[19]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[20]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[21]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[22]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.