Arbitrarily modulated Markov decision processes

We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary (e.g., nonstationary) fashion. We propose an online Q-learning style algorithm and give a guarantee on its performance evaluated in retrospect against alternative policies. Unlike previous works, the guarantee depends critically on the variability of the uncertainty in the transition probabilities, but holds regardless of arbitrary changes in rewards and transition probabilities over time. Besides its intrinsic computational efficiency, this approach requires neither prior knowledge nor estimation of the transition probabilities.

[1]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[3]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[4]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[5]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[6]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[7]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[8]  Shie Mannor,et al.  Online learning in Markov decision processes with arbitrarily changing rewards and transitions , 2009, 2009 International Conference on Game Theory for Networks.

[9]  Shie Mannor,et al.  The Empirical Bayes Envelope and Regret Minimization in Competitive Markov Decision Processes , 2003, Math. Oper. Res..

[10]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[11]  Shie Mannor,et al.  The Robustness-Performance Tradeoff in Markov Decision Processes , 2006, NIPS.

[12]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[13]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[14]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[15]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[16]  P. Schweitzer Perturbation theory and finite Markov chains , 1968 .

[17]  Süleyman Özekici Markov modulated Bernoulli process , 1997, Math. Methods Oper. Res..

[18]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[19]  C. Fuh Asymptotic operating characteristics of an optimal change point detection in hidden Markov models , 2004, math/0503682.

[20]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[21]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[22]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[23]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[24]  D. Donoho,et al.  Uncertainty principles and signal recovery , 1989 .

[25]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, Journal of computer and system sciences (Print).