On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts

A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring States (MCES) method, also known as optimistic policy iteration, in which the value function is approximated by simulated returns and a greedy policy is selected at each iteration. The convergence of this algorithm in the general setting has been an open question. In this paper, we investigate the convergence of this algorithm for the case with undiscounted costs, also known as the stochastic shortest path problem. The results complement existing partial results on this topic and thereby helps further settle the open problem. As a side result, we also provide a proof of a version of the supermartingale convergence theorem commonly used in stochastic approximation.

[1]  Carl D. Meyer,et al.  Matrix Analysis and Applied Linear Algebra , 2000 .

[2]  H. Robbins A Stochastic Approximation Method , 1951 .

[3]  Keith Ross,et al.  On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning , 2020, ArXiv.

[4]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[5]  Yuanlong Chen,et al.  On the convergence of optimistic policy iteration for stochastic shortest path problem , 2018, ArXiv.

[6]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[7]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[8]  John N. Tsitsiklis,et al.  On the Convergence of Optimistic Policy Iteration , 2002, J. Mach. Learn. Res..

[9]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[10]  David Silver,et al.  Reinforcement learning and simulation-based search in computer go , 2009 .

[11]  E. Denardo CONTRACTION MAPPINGS IN THE THEORY UNDERLYING DYNAMIC PROGRAMMING , 1967 .

[12]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  R. Ash,et al.  Real analysis and probability , 1975 .

[15]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 2004, Machine Learning.