Approximate Gradient Methods in Policy-Space Optimization of Markov Reward Processes

We consider a discrete time, finite state Markov reward process that depends on a set of parameters. We start with a brief review of (stochastic) gradient descent methods that tune the parameters in order to optimize the average reward, using a single (possibly simulated) sample path of the process of interest. The resulting algorithms can be implemented online, and have the property that the gradient of the average reward converges to zero with probability 1. On the other hand, the updates can have a high variance, resulting in slow convergence. We address this issue and propose two approaches to reduce the variance. These approaches rely on approximate gradient formulas, which introduce an additional bias into the update direction. We derive bounds for the resulting bias terms and characterize the asymptotic behavior of the resulting algorithms. For one of the approaches considered, the magnitude of the bias term exhibits an interesting dependence on the time it takes for the rewards to reach steady-state. We also apply the methodology to Markov reward processes with a reward-free termination state, and an expected total reward criterion. We use a call admission control problem to illustrate the performance of the proposed algorithms.

[1]  Peter W. Glynn,et al.  Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[2]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[3]  Michael C. Fu,et al.  Smoothed perturbation analysis derivative estimation for Markov chains , 1994, Oper. Res. Lett..

[4]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[5]  E. Chong,et al.  Stochastic optimization of regenerative systems using infinitesimal perturbation analysis , 1994, IEEE Trans. Autom. Control..

[6]  Robert G. Gallager,et al.  Discrete Stochastic Processes , 1995 .

[7]  V. Tresp,et al.  Missing and noisy data in nonlinear time-series prediction , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[8]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10]  Shigenobu Kobayashi,et al.  Reinforcement Learning in POMDPs with Function Approximation , 1997, ICML.

[11]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[12]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[13]  Peter Marbach,et al.  Simulation-based optimization of Markov decision processes , 1998 .

[14]  J. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes: implementation issues , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[15]  P. Bartlett,et al.  Direct Gradient-Based Reinforcement Learning: I. Gradient Estimation Algorithms , 1999 .

[16]  Xi-Ren Cao,et al.  A unified approach to Markov decision problems and performance sensitivity analysis , 2000, at - Automatisierungstechnik.

[17]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[18]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[19]  P. Glynn LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .

[20]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .