Simulation-based optimization of Markov reward processes

This paper proposes a simulation-based algorithm for optimizing the average reward in a finite-state Markov reward process that depends on a set of parameters. As a special case, the method applies to Markov decision processes where optimization takes place within a parametrized set of policies. The algorithm relies on the regenerative structure of finite-state Markov processes, involves the simulation of a single sample path, and can be implemented online. A convergence result (with probability 1) is provided.

[1]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  Peter W. Glynn,et al.  Stochastic approximation for Monte Carlo optimization , 1986, WSC '86.

[3]  M. Kurano LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES , 1987 .

[4]  Peter W. Glynn,et al.  Likelilood ratio gradient estimation: an overview , 1987, WSC '87.

[5]  M. D. Wilkinson,et al.  Management science , 1989, British Dental Journal.

[6]  Donald L. Iglehart,et al.  Importance sampling for stochastic simulations , 1989 .

[7]  P. L’Ecuyer,et al.  A Unified View of the IPA, SF, and LR Gradient Estimation Techniques , 1990 .

[8]  Paul Glasserman,et al.  Gradient Estimation Via Perturbation Analysis , 1990 .

[9]  Peter W. Glynn,et al.  Gradient estimation for ratios , 1991, 1991 Winter Simulation Conference Proceedings..

[10]  Paul Glasserman,et al.  Gradient estimation for regenerative processes , 1992, WSC '92.

[11]  Michael C. Fu,et al.  Smoothed perturbation analysis derivative estimation for Markov chains , 1994, Oper. Res. Lett..

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[14]  E. Chong,et al.  Stochastic optimization of regenerative systems using infinitesimal perturbation analysis , 1994, IEEE Trans. Autom. Control..

[15]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[16]  V. Tresp,et al.  Missing and noisy data in nonlinear time-series prediction , 1995, Proceedings of 1995 IEEE Workshop on Neural Networks for Signal Processing.

[17]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[18]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[19]  B. Delyon General results on the convergence of stochastic algorithms , 1996, IEEE Trans. Autom. Control..

[20]  V. Borkar Stochastic approximation with two time scales , 1997 .

[21]  D. Bertsekas Gradient convergence in gradient methods , 1997 .

[22]  Xi-Ren Cao,et al.  Perturbation realization, potentials, and sensitivity analysis of Markov processes , 1997, IEEE Trans. Autom. Control..

[23]  Xi-Ren Cao,et al.  Algorithms for sensitivity analysis of Markov systems through potentials and perturbation realization , 1998, IEEE Trans. Control. Syst. Technol..

[24]  Peter Marbach,et al.  Simulation-based optimization of Markov decision processes , 1998 .

[25]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[26]  J. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes: implementation issues , 1999, Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304).

[27]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[28]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[29]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[30]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[31]  Tamer Basar,et al.  Analysis of Recursive Stochastic Algorithms , 2001 .

[32]  P. Glynn LIKELIHOOD RATIO GRADIENT ESTIMATION : AN OVERVIEW by , 2022 .

[33]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .