Finite Sample Analysis for TD(0) with Linear Function Approximation

TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such a result. Works that managed to obtain concentration bounds for online Temporal Difference (TD) methods analyzed modified versions of them, carefully crafted for the analyses to hold. These modifications include projections and step-sizes dependent on unknown problem parameters. Our analysis obviates these artificial alterations by exploiting strong properties of TD(0) and tailor-made stochastic approximation tools.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2]  S. Menozzi,et al.  Concentration bounds for stochastic approximations , 2012, 1204.3730.

[3]  Nathaniel Korda,et al.  On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[4]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[5]  V. Borkar,et al.  A Concentration Bound for Stochastic Approximation via Alekseev’s Formula , 2015, Stochastic Systems.

[6]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[7]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[8]  M. Hirsch,et al.  Differential Equations, Dynamical Systems, and an Introduction to Chaos , 2003 .

[9]  M. Fathi,et al.  Transport-Entropy inequalities and deviation estimates for stochastic approximation schemes , 2013, 1301.7740.

[10]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[11]  Shie Mannor,et al.  Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, COLT.

[12]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[13]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[14]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[15]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[16]  Martha White,et al.  Accelerated Gradient Temporal Difference Learning , 2016, AAAI.

[17]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[18]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[19]  V. Lakshmikantham,et al.  Method of Variation of Parameters for Dynamic Systems , 1998 .

[20]  Sameer Kamal,et al.  On the Convergence, Lock-In Probability, and Sample Complexity of Stochastic Approximation , 2010, SIAM J. Control. Optim..

[21]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[22]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[23]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[24]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[25]  T. Sideris Ordinary Differential Equations and Dynamical Systems , 2013 .

[26]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[27]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.