Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning

Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample analysis. Using this, we provide a concentration bound, which is the first such result for a two-timescale SA. The type of bound we obtain is known as `lock-in probability'. We also introduce a new projection scheme, in which the time between successive projections increases exponentially. This scheme allows one to elegantly transform a lock-in probability into a convergence rate result for projected two-timescale SA. From this latter result, we then extract key insights on stepsize selection. As an application, we finally obtain convergence rates for the projected two-timescale RL algorithms GTD(0), GTD2, and TDC.

[1]  Csaba Szepesvári,et al.  Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[2]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[3]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[4]  H. Kushner A projected stochastic approximation method for adaptive filters and identifiers , 1980 .

[5]  Francis R. Bach,et al.  Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2014, AISTATS 2014.

[6]  J. Zico Kolter,et al.  The Fixed Points of Off-Policy TD , 2011, NIPS.

[7]  V. Lakshmikantham,et al.  Method of Variation of Parameters for Dynamic Systems , 1998 .

[8]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[9]  L. Gerencsér Rate of convergence of moments of Spall's SPSA method , 1997, 1997 European Control Conference (ECC).

[10]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[11]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[12]  A. Mokkadem,et al.  Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms , 2006, math/0610329.

[13]  J. Tsitsiklis,et al.  Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[14]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[15]  Shie Mannor,et al.  Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[16]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[17]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[18]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[19]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[20]  V. Borkar,et al.  A Concentration Bound for Stochastic Approximation via Alekseev’s Formula , 2015, Stochastic Systems.

[21]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[22]  Daniela Fischer Differential Equations Dynamical Systems And An Introduction To Chaos , 2016 .

[23]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[24]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[25]  Shalabh Bhatnagar,et al.  A stability criterion for two timescale stochastic approximation schemes , 2017, Autom..

[26]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[27]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[28]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[29]  Nathaniel Korda,et al.  On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[30]  T. Sideris Ordinary Differential Equations and Dynamical Systems , 2013 .