Finite-sample analysis of least-squares policy iteration

In this paper, we report a performance bound for the widely used least-squares policy iteration (LSPI) algorithm. We first consider the problem of policy evaluation in reinforcement learning, that is, learning the value function of a fixed policy, using the least-squares temporal-difference (LSTD) learning method, and report finite-sample analysis for this algorithm. To do so, we first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is b-mixing. Finally, we analyze how the error at each policy evaluation step is propagated through the iterations of a policy iteration method, and derive a performance bound for the LSPI algorithm.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[2]  P. Schweitzer,et al.  Generalized polynomial approximations in Markovian decision processes , 1985 .

[3]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[4]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[5]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6]  Leemon C. Baird,et al.  Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[9]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[10]  John N. Tsitsiklis,et al.  Average cost temporal-difference learning , 1997, Proceedings of the 36th IEEE Conference on Decision and Control.

[11]  Justin A. Boyan,et al.  Least-Squares Temporal Difference Learning , 1999, ICML.

[12]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[13]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[14]  Steven J. Bradtke,et al.  Linear Least-Squares algorithms for temporal difference learning , 2004, Machine Learning.

[15]  Ron Meir,et al.  Nonparametric Time Series Prediction Through Adaptive Model Selection , 2000, Machine Learning.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  M. Talagrand The Generic chaining : upper and lower bounds of stochastic processes , 2005 .

[18]  T. Lai,et al.  Pseudo-maximization and self-normalized processes , 2007, 0709.2233.

[19]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[20]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[21]  V. Peña,et al.  Exponential inequalities for self-normalized processes with applications , 2009 .

[22]  Bruno Scherrer,et al.  Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view , 2010, ICML.

[23]  Dimitri P. Bertsekas,et al.  Error Bounds for Approximations from Projected Linear Equations , 2010, Math. Oper. Res..

[24]  S. Delattre,et al.  Nonparametric regression with martingale increment errors , 2010, 1010.6209.

[25]  Huizhen Yu,et al.  Convergence of Least Squares Temporal Difference Methods Under General Conditions , 2010, ICML.

[26]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[27]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[28]  Alessandro Lazaric,et al.  LSTD with Random Projections , 2010, NIPS.

[29]  Matthew W. Hoffman,et al.  Finite-Sample Analysis of Lasso-TD , 2011, ICML.

[30]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[31]  Bruno Scherrer,et al.  Classification-based Policy Iteration with a Critic , 2011, ICML.

[32]  Sham M. Kakade,et al.  Random Design Analysis of Ridge Regression , 2012, COLT.

[33]  Csaba Szepesvári,et al.  Statistical linear estimation with penalized estimators: an application to reinforcement learning , 2012, ICML.