On the Sample Complexity of Reinforcement Learning with a Generative Model

We consider the problems of learning the optimal action-value function and the optimal policy in discounted-reward Markov decision processes (MDPs). We prove new PAC bounds on the sample-complexity of two well-known model-based reinforcement learning (RL) algorithms in the presence of a generative model of the MDP: value iteration and policy iteration. The first result indicates that for an MDP with N state-action pairs and the discount factor γ∈[0,1) only O(Nlog(N/δ)/((1−γ)3ε2)) state-transition samples are required to find an ε-optimal estimation of the action-value function with the probability (w.p.) 1−δ. Further, we prove that, for small values of ε, an order of O(Nlog(N/δ)/((1−γ)3ε2)) samples is required to find an ε-optimal policy w.p. 1−δ. We also prove a matching lower bound of Θ(Nlog(N/δ)/((1−γ)3ε2)) on the sample complexity of estimating the optimal action-value function with ε accuracy. To the best of our knowledge, this is the first minimax result on the sample complexity of RL: the upper bounds match the lower bound in terms of N, ε, δ and 1/(1−γ) up to a constant factor. Also, both our lower bound and upper bound improve on the state-of-the-art in terms of their dependence on 1/(1−γ).

[1]  Tor Lattimore,et al.  PAC Bounds for Discounted MDPs , 2012, ALT.

[2]  H. Kappen,et al.  On the Sample Complexity of Reinforcement Learning with a , 2012 .

[3]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[4]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  John N. Tsitsiklis,et al.  The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[10]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[11]  Csaba Szepesv Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010 .

[12]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[13]  Claudio De Persis,et al.  Proceedings of the 38th IEEE conference on decision and control , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[14]  Rafael Castro-Linares,et al.  Trajectory tracking for non-holonomic cars: A linear approach to controlled leader-follower formation , 2010, 49th IEEE Conference on Decision and Control (CDC).

[15]  Csaba Szepesv Algorithms for Reinforcement Learning , 2010 .

[16]  Ambuj Tewari,et al.  REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[17]  R. Munos,et al.  Influence and variance of a Markov chain: application to adaptive discretization in optimal control , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[18]  Satinder Singh,et al.  An upper bound on the loss from approximate optimal-value functions , 1994, Machine Learning.

[19]  Marco Wiering,et al.  Reinforcement Learning , 2014, Adaptation, Learning, and Optimization.

[20]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[21]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[22]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[23]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[24]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[25]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[26]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[27]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[28]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[29]  Torben Hagerup,et al.  A Guided Tour of Chernoff Bounds , 1990, Inf. Process. Lett..

[30]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[31]  H. Kappen,et al.  Reinforcement Learning with a Near Optimal Rate of Convergence , 2011 .

[32]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .