论文信息 - Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning - 字舞流文

Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning

Two-timescale Stochastic Approximation (SA) algorithms are widely used in Reinforcement Learning (RL). Their iterates have two parts that are updated using distinct stepsizes. In this work, we develop a novel recipe for their finite sample analysis. Using this, we provide a concentration bound, which is the first such result for a two-timescale SA. The type of bound we obtain is known as `lock-in probability'. We also introduce a new projection scheme, in which the time between successive projections increases exponentially. This scheme allows one to elegantly transform a lock-in probability into a convergence rate result for projected two-timescale SA. From this latter result, we then extract key insights on stepsize selection. As an application, we finally obtain convergence rates for the projected two-timescale RL algorithms GTD(0), GTD2, and TDC.

Shie Mannor | Balázs Szörényi | Gugan Thoppe | Gal Dalal | Shie Mannor | Gal Dalal | Gugan Thoppe | Balázs Szörényi

[1] Csaba Szepesvári,et al. Linear Stochastic Approximation: How Far Does Constant Step-Size and Iterate Averaging Go? , 2018, AISTATS.

[2] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[3] Shalabh Bhatnagar,et al. Natural actor-critic algorithms , 2009, Autom..

[4] H. Kushner. A projected stochastic approximation method for adaptive filters and identifiers , 1980 .

[5] Francis R. Bach,et al. Constant Step Size Least-Mean-Square: Bias-Variance Trade-offs and Optimal Sampling Distributions , 2014, AISTATS 2014.

[6] J. Zico Kolter,et al. The Fixed Points of Off-Policy TD , 2011, NIPS.

[7] V. Lakshmikantham,et al. Method of Variation of Parameters for Dynamic Systems , 1998 .

[8] Marek Petrik,et al. Finite-Sample Analysis of Proximal Gradient TD Algorithms , 2015, UAI.

[9] L. Gerencsér. Rate of convergence of moments of Spall's SPSA method , 1997, 1997 European Control Conference (ECC).

[10] Shalabh Bhatnagar,et al. Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation , 2009, NIPS.

[11] John N. Tsitsiklis,et al. Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[12] A. Mokkadem,et al. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms , 2006, math/0610329.

[13] J. Tsitsiklis,et al. Convergence rate of linear two-time-scale stochastic approximation , 2004, math/0405287.

[14] Richard S. Sutton,et al. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[15] Shie Mannor,et al. Finite Sample Analyses for TD(0) With Function Approximation , 2017, AAAI.

[16] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[17] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[18] Sean P. Meyn,et al. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[19] Harold J. Kushner,et al. Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[20] V. Borkar,et al. A Concentration Bound for Stochastic Approximation via Alekseev’s Formula , 2015, Stochastic Systems.

[21] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[22] Daniela Fischer. Differential Equations Dynamical Systems And An Introduction To Chaos , 2016 .

[23] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[24] J. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[25] Shalabh Bhatnagar,et al. A stability criterion for two timescale stochastic approximation schemes , 2017, Autom..

[26] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[27] Shalabh Bhatnagar,et al. Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[28] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[29] Nathaniel Korda,et al. On TD(0) with function approximation: Concentration bounds and a centered variant with exponential convergence , 2014, ICML.

[30] T. Sideris. Ordinary Differential Equations and Dynamical Systems , 2013 .