Finite Sample Analyses for TD(0) With Function Approximation

TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such results. Existing convergence rates for Temporal Difference (TD) methods apply only to somewhat modified versions, e.g., projected variants or ones where stepsizes depend on unknown problem parameters. Our analyses obviate these artificial alterations by exploiting strong properties of TD(0). We provide convergence rates both in expectation and with high-probability. The two are obtained via different approaches that use relatively unknown, recently developed stochastic approximation techniques.

[1]  Alessandro Lazaric,et al.  Finite-Sample Analysis of LSTD , 2010, ICML.

[2]  M. Fathi,et al.  Transport-Entropy inequalities and deviation estimates for stochastic approximation schemes , 2013, 1301.7740.

[4]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[5]  Shie Mannor,et al.  Concentration Bounds for Two Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, ArXiv.

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[8]  S. Menozzi,et al.  Concentration bounds for stochastic approximations , 2012, 1204.3730.

[9]  Martha White,et al.  Accelerated Gradient Temporal Difference Learning , 2016, AAAI.

[10]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[11]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[12]  T. Sideris Ordinary Differential Equations and Dynamical Systems , 2013 .

[13]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[14]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[15]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[16]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[17]  Noah Williams,et al.  Stability and Long Run Equilibrium in Stochastic Fictitious Play , 2002 .

[18]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[19]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[20]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[21]  Daniela Fischer Differential Equations Dynamical Systems And An Introduction To Chaos , 2016 .

[22]  Shie Mannor,et al.  Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning , 2017, COLT.

[23]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[24]  V. Lakshmikantham,et al.  Method of Variation of Parameters for Dynamic Systems , 1998 .

[25]  Sameer Kamal,et al.  On the Convergence, Lock-In Probability, and Sample Complexity of Stochastic Approximation , 2010, SIAM J. Control. Optim..

[26]  Nathaniel Korda,et al.  On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[27]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[28]  Csaba Szepesvari,et al.  Finite Time Bounds for Temporal Difference Learning with Function Approximation: Problems with some “state-of-the-art” results , 2017 .

[29]  Dimitri P. Bertsekas,et al.  Convergence Results for Some Temporal Difference Methods Based on Least Squares , 2009, IEEE Transactions on Automatic Control.

[30]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[31]  V. Borkar,et al.  A Concentration Bound for Stochastic Approximation via Alekseev’s Formula , 2015, Stochastic Systems.

[32]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[33]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.