Two-Timescale Stochastic Approximation Convergence Rates with Applications to Reinforcement Learning

Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated with distinct stepsizes. In this work we provide a recipe for analyzing two-timescale SA. Using it, we develop the first convergence rate result for them. From this result we extract key insights on stepsize selection. As an application, we obtain convergence rates for two-timescale RL algorithms such as GTD(0), GTD2, and TDC.

[1]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[2]  V. Lakshmikantham,et al.  Method of Variation of Parameters for Dynamic Systems , 1998 .

[3]  T. Sideris Ordinary Differential Equations and Dynamical Systems , 2013 .

[4]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[5]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[6]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[7]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[8]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[9]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[10]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[11]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[12]  Nathaniel Korda,et al.  On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[13]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[14]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[15]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[16]  V. Borkar,et al.  A Concentration Bound for Stochastic Approximation via Alekseev’s Formula , 2015, Stochastic Systems.