Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path

We consider batch reinforcement learning problems in continuous space, expected total discounted-reward Markovian Decision Problems. As opposed to previous theoretical work, we consider the case when the training data consists of a single sample path (trajectory) of some behaviour policy. In particular, we do not assume access to a generative model of the environment. The algorithm studied is policy iteration where in successive iterations the Q-functions of the intermediate policies are obtained by means of minimizing a novel Bellman-residual type error. PAC-style polynomial bounds are derived on the number of samples needed to guarantee near-optimal performance where the bound depends on the mixing rate of the trajectory, the smoothness properties of the underlying Markovian Decision Problem, the approximation power and capacity of the function set used.

[1]  R. Bellman,et al.  FUNCTIONAL APPROXIMATIONS AND DYNAMIC PROGRAMMING , 1959 .

[2]  E. Cheney Introduction to approximation theory , 1966 .

[3]  Y. Davydov Mixing Conditions for Markov Chains , 1974 .

[4]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[5]  D. Pollard Convergence of stochastic processes , 1984 .

[6]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[7]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[8]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[9]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[10]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[11]  David Haussler,et al.  Sphere Packing Numbers for Subsets of the Boolean n-Cube with Bounded Vapnik-Chervonenkis Dimension , 1995, J. Comb. Theory, Ser. A.

[12]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[13]  A. Nobel Histogram regression estimation using data-dependent partitions , 1996 .

[14]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[15]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[16]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[17]  Arthur L. Samuel,et al.  Some studies in machine learning using the game of checkers , 2000, IBM J. Res. Dev..

[18]  Carlos Guestrin,et al.  Max-norm Projections for Factored MDPs , 2001, IJCAI.

[19]  Y. Baraud,et al.  ADAPTIVE ESTIMATION IN AUTOREGRESSION OR β-MIXING REGRESSION VIA MODEL SELECTION By , 2001 .

[20]  Xin Wang,et al.  Batch Value Function Approximation via Support Vectors , 2001, NIPS.

[21]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[22]  Xiaohong Chen,et al.  MIXING AND MOMENT PROPERTIES OF VARIOUS GARCH AND STOCHASTIC VOLATILITY MODELS , 2002, Econometric Theory.

[23]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[24]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[25]  Ron Meir,et al.  Nonparametric Time Series Prediction Through Adaptive Model Selection , 2000, Machine Learning.

[26]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[27]  William D. Smart,et al.  Interpolation-based Q-learning , 2004, ICML.

[28]  Csaba Szepesvári,et al.  Finite time bounds for sampling based fitted value iteration , 2005, ICML.

[29]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Susan A. Murphy,et al.  A Generalization Error for Q-Learning , 2005, J. Mach. Learn. Res..

[32]  Dimitri P. Bertsekas,et al.  Stochastic optimal control : the discrete time case , 2007 .