Regret Minimization in Nonstationary Markov Decision Processes

We consider decision-making problems in Markov decision processes where both the rewards and the transition probabilities vary in an arbitrary ( e.g., nonstationary) fashion to some extent. We propose online learning algorithms and provide guarantees on their performance evaluated in retrospect against stationary policies. Unlike previous works, the gu arantees depend critically on the variability of the uncertainty in the transition probabilities, but hol d regardless of arbitrary changes in rewards and transition probabilities. First, we use an approach bas ed on robust dynamic programming and extend it to the case where reward observation is limited to the actual state-action trajectory. Next, we present a computationally efficient simulation-based Q-learning style algorithm that requires neither prior knowledge nor estimation of the transition probabilities. We show both probabilistic performance guarantees and deterministic guarantees on the expected performance 1 .

[1]  P. Schweitzer Perturbation theory and finite Markov chains , 1968 .

[2]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[3]  Süleyman Özekici Markov modulated Bernoulli process , 1997, Math. Methods Oper. Res..

[4]  Yishay Mansour,et al.  Online Markov Decision Processes , 2009, Math. Oper. Res..

[5]  Shie Mannor,et al.  Online learning in Markov decision processes with arbitrarily changing rewards and transitions , 2009, 2009 International Conference on Game Theory for Networks.

[6]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[7]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[8]  Sandeep Pandey,et al.  Handling Advertisements of Unknown Quality in Search Advertising , 2006, NIPS.

[9]  C. Fuh Asymptotic operating characteristics of an optimal change point detection in hidden Markov models , 2004, math/0503682.

[10]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[11]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[12]  Shie Mannor,et al.  The Empirical Bayes Envelope and Regret Minimization in Competitive Markov Decision Processes , 2003, Math. Oper. Res..

[13]  Peng Shi,et al.  Limiting Average Criteria For Nonstationary Markov Decision Processes , 2000, SIAM J. Optim..

[14]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[15]  Harold J. Kushner,et al.  Stochastic Approximation Algorithms and Applications , 1997, Applications of Mathematics.

[16]  Shie Mannor,et al.  The Robustness-Performance Tradeoff in Markov Decision Processes , 2006, NIPS.

[17]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[18]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[19]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[20]  Shie Mannor,et al.  Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.

[21]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[22]  Shie Mannor,et al.  Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..

[23]  Naoki Abe,et al.  Learning to Optimally Schedule Internet Banner Advertisements , 1999, ICML.

[24]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[25]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[26]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[27]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[28]  D. Donoho,et al.  Uncertainty principles and signal recovery , 1989 .

[29]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[30]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[31]  Andrew W. Moore,et al.  Locally Weighted Learning , 1997, Artificial Intelligence Review.

[32]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, Journal of computer and system sciences (Print).

[33]  Vivek S. Borkar,et al.  Learning Algorithms for Markov Decision Processes with Average Cost , 2001, SIAM J. Control. Optim..

[34]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .