Concentration Bounds for Two Timescale Stochastic Approximation with Applications to Reinforcement Learning

Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). In such methods, the iterates consist of two parts that are updated using different stepsizes. We develop the first convergence rate result for these algorithms; in particular, we provide a general methodology for analyzing two-timescale linear SA. We apply our methodology to two-timescale RL algorithms such as GTD(0), GTD2, and TDC.

[1]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[2]  V. Lakshmikantham,et al.  Method of Variation of Parameters for Dynamic Systems , 1998 .

[3]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[4]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[5]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[6]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[7]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[8]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[9]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[10]  Shalabh Bhatnagar,et al.  Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[11]  T. Sideris Ordinary Differential Equations and Dynamical Systems , 2013 .

[12]  Nathaniel Korda,et al.  On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[13]  Marek Petrik,et al.  Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[14]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[15]  V. Borkar,et al.  A Concentration Bound for Stochastic Approximation via Alekseev’s Formula , 2015, Stochastic Systems.