Finite time bounds for sampling based fitted value iteration

In this paper we consider sampling based fitted value iteration for discounted, large (possibly infinite) state space, finite action Markovian Decision Problems where only a generative model of the transition probabilities and rewards is available. At each step the image of the current estimate of the optimal value function under a Monte-Carlo approximation to the Bellman-operator is projected onto some function space. PAC-style bounds on the weighted Lp-norm approximation error are obtained as a function of the covering number and the approximation power of the function space, the iteration number and the sample size.

[1]  R. Bellman,et al.  FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .

[2]  Arthur L. Samuel,et al.  Some Studies in Machine Learning Using the Game of Checkers , 1967, IBM J. Res. Dev..

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  E. Cheney Introduction to approximation theory , 1966 .

[5]  A. L. Samuel,et al.  Some studies in machine learning using the game of checkers. II: recent progress , 1967 .

[6]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[7]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[8]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[9]  Thomas L. Morin,et al.  COMPUTATIONAL ADVANCES IN DYNAMIC PROGRAMMING , 1978 .

[10]  C. J. Stone,et al.  Optimal Rates of Convergence for Nonparametric Estimators , 1980 .

[11]  C. J. Stone,et al.  Optimal Global Rates of Convergence for Nonparametric Regression , 1982 .

[12]  J. Tsitsiklis,et al.  An optimal multigrid algorithm for continuous state discrete time stochastic control , 1988, Proceedings of the 27th IEEE Conference on Decision and Control.

[13]  John N. Tsitsiklis,et al.  The complexity of dynamic programming , 1989, J. Complex..

[14]  P. Bougerol,et al.  Strict Stationarity of Generalized Autoregressive Processes , 1992 .

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[17]  M. Talagrand Sharper Bounds for Gaussian and Empirical Processes , 1994 .

[18]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[19]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[20]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[21]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[22]  I. Johnstone,et al.  Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .

[23]  Andrew G. Barto,et al.  Improving Elevator Performance Using Reinforcement Learning , 1995, NIPS.

[24]  Wei Zhang,et al.  A Reinforcement Learning Approach to job-shop Scheduling , 1995, IJCAI.

[25]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[26]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[27]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[28]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[29]  R. DeVore,et al.  Nonlinear approximation , 1998, Acta Numerica.

[30]  O. Linton,et al.  The asymptotic distribution of nonparametric estimates of the Lyapunov exponent for stochastic time series , 1999 .

[31]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[32]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[33]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[34]  Thomas G. Dietterich,et al.  Efficient Value Function Approximation Using Regression Trees , 1999 .

[35]  Federico Girosi,et al.  Generalization bounds for function approximation from scattered noisy data , 1999, Adv. Comput. Math..

[36]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[37]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[38]  Francis A. Longstaff,et al.  Valuing American Options by Simulation: A Simple Least-Squares Approach , 2001 .

[39]  Bernhard Schölkopf,et al.  Learning with kernels , 2001 .

[40]  Csaba Szepesvári,et al.  Efficient approximate planning in continuous space Markovian Decision Problems , 2001, AI Commun..

[41]  Xin Wang,et al.  Batch Value Function Approximation via Support Vectors , 2001, NIPS.

[42]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[43]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[44]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[45]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[46]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[47]  M. Haugh Duality theory and simulation in financial engineering , 2003, Proceedings of the 2003 Winter Simulation Conference, 2003..

[48]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[49]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[50]  Abhijit Gosavi,et al.  A Reinforcement Learning Algorithm Based on Policy Iteration for Average Reward: Empirical Results with Yield Management and Convergence Analysis , 2004, Machine Learning.

[51]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[52]  Thomas Uthmann,et al.  Experiments in Value Function Approximation with Sparse Support Vector Regression , 2004, ECML.

[53]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[54]  William D. Smart,et al.  Interpolation-based Q-learning , 2004, ICML.

[55]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[56]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[57]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[58]  Rémi Munos,et al.  Error Bounds for Approximate Value Iteration , 2005, AAAI.

[59]  Susan A. Murphy,et al.  A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[60]  A. Krzyżak,et al.  Adaptive regression estimation with multilayer feedforward neural networks , 2005 .

[61]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[62]  Hao Helen Zhang,et al.  Component selection and smoothing in multivariate nonparametric regression , 2006, math/0702659.

[63]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .

[64]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[65]  Abhijit Gosavi,et al.  Self-Improving Factory Simulation using Continuous-time Average-Reward Reinforcement Learning , 2007 .