Performance Bounds in Lp-norm for Approximate Value Iteration

Approximate value iteration (AVI) is a method for solving large Markov decision problems by approximating the optimal value function with a sequence of value function representations $V_n$ processed according to the iterations $V_{n+1}=\mathcal{AT} V_n$, where $\mathcal{T}$ is the so-called Bellman operator and $\mathcal{A}$ an approximation operator, which may be implemented by a supervised learning (SL) algorithm. Usual bounds on the asymptotic performance of AVI are established in terms of the $L_\infty$-norm approximation errors induced by the SL algorithm. However, most widely used SL algorithms (such as least squares regression) return a function (the best fit) that minimizes an empirical approximation error in $L_p$-norm ($p\geq 1$). In this paper, we extend the performance bounds of AVI to weighted $L_p$-norms, which enables us to directly relate the performance of AVI to the approximation power of the SL algorithm, hence assuring the tightness and practical relevance of these bounds. The main result is a performance bound of the resulting policies expressed in terms of the $L_p$-norm errors introduced by the successive approximations. The new bound takes into account a concentration coefficient that estimates how much the discounted future-state distributions starting from a probability measure used to assess the performance of AVI can possibly differ from the distribution used in the regression operation. We illustrate the tightness of the bounds on an optimal replacement problem.

[1]  R. Bellman Dynamic programming. , 1957, Science.

[2]  D. Pollard Convergence of stochastic processes , 1984 .

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[4]  O. Hernández-Lerma,et al.  Recurrence conditions for Markov decision processes with Borel state space: A survey , 1991 .

[5]  A. Hordijk,et al.  On ergodicity and recurrence properties of a Markov chain by an application to an open jackson network , 1992, Advances in Applied Probability.

[6]  P. Bougerol,et al.  Strict Stationarity of Generalized Autoregressive Processes , 1992 .

[7]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[8]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[9]  John Rust Numerical dynamic programming in economics , 1996 .

[10]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[11]  S. Mallat,et al.  Adaptive greedy approximations , 1997 .

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  S. Mallat A wavelet tour of signal processing , 1998 .

[14]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[15]  Daphne Koller,et al.  Policy Iteration for Factored MDPs , 2000, UAI.

[16]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[17]  Carlos Guestrin,et al.  Max-norm Projections for Factored MDPs , 2001, IJCAI.

[18]  Sean P. Meyn Stability, Performance Evaluation, and Optimization , 2002 .

[19]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[20]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[21]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[22]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[23]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[24]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.