Estimation and Approximation Bounds for Gradient-Based Reinforcement Learning

We model reinforcement learning as the problem of learning to control a partially observable Markov decision process (POMDP) and focus on gradient ascent approaches to this problem. In an earlier work (2001, J. Artificial Intelligence Res.14) we introduced GPOMDP, an algorithm for estimating the performance gradient of a POMDP from a single sample path, and we proved that this algorithm almost surely converges to an approximation to the gradient. In this paper, we provide a convergence rate for the estimates produced by GPOMDP and give an improved bound on the approximation error of these estimates. Both of these bounds are in terms of mixing times of the POMDP.

[1]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[2]  Peter W. Glynn,et al.  Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[3]  Alan Weiss,et al.  Sensitivity analysis via likelihood ratios , 1986, WSC '86.

[4]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5]  Dharmendra S. Modha,et al.  Minimum complexity regression estimation with weakly dependent observations , 1996, IEEE Trans. Inf. Theory.

[6]  Shigenobu Kobayashi,et al.  Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[7]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[8]  P. Marbach Simulation-Based Methods for Markov Decision Processes , 1998 .

[9]  Andrew W. Moore,et al.  Gradient Descent for General Reinforcement Learning , 1998, NIPS.

[10]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[11]  P. Kumar,et al.  Learning dynamical systems in a stationary environment , 1998 .

[12]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[13]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[14]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[15]  P. Bartlett,et al.  Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms , 1999 .

[16]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[17]  Ron Meir,et al.  Nonparametric Time Series Prediction Through Adaptive Model Selection , 2000, Machine Learning.