Geometric Variance Reduction in Markov Chains: Application to Value Function and Gradient Estimation

We study a sequential variance reduction technique for Monte Carlo estimation of functionals in Markov Chains. The method is based on designing sequential control variates using successive approximations of the function of interest V. Regular Monte Carlo estimates have a variance of O(1/N), where N is the number of samples. Here, we obtain a geometric variance reduction O(ρN) (with ρ < 1) up to a threshold that depends on the approximation error V - AV, where A is an approximation operator linear in the values. Thus, if V belongs to the right approximation space (i.e. AV = V), the variance decreases geometrically to zero. An immediate application is value function estimation in Markov chains, which may be used for policy evaluation in policy iteration for Markov Decision Processes. Another important domain, for which variance reduction is highly needed, is gradient estimation, that is computing the sensitivity ∂α, V of the performance measure V with respect to some parameter α of the transition probabilities. For example, in parametric optimization of the policy, an estimate of the policy gradient is required to perform a gradient optimization method. We show that, using two approximations, the value function and the gradient, a geometric variance reduction is also achieved, up to a threshold that depends on the approximation errors of both of those representations.

[1]  J. Hammersley,et al.  Monte Carlo Methods , 1966 .

[2]  J. Halton A Retrospective and Prospective Survey of the Monte Carlo Method , 1970 .

[3]  Alan Weiss,et al.  Sensitivity analysis via likelihood ratios , 1986, WSC '86.

[4]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[5]  J. Halton Sequential monte carlo techniques for the solution of linear systems , 1994 .

[6]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[7]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[8]  Dennis D. Cox,et al.  Adaptive importance sampling on discrete Markov chains , 1999 .

[9]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[10]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[11]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[12]  John N. Tsitsiklis,et al.  Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes , 2003, Discret. Event Dyn. Syst..

[13]  S. Maire An iterative computation of approximations on Korobov-like spaces , 2003 .

[14]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[15]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[16]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[17]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[18]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[19]  Sylvain Maire,et al.  Sequential Control Variates for Functionals of Markov Processes , 2005, SIAM J. Numer. Anal..

[20]  P. S. Dwyer Annals of Applied Probability , 2006 .

[21]  P. Glynn LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .