Adaptive Strategies and Regret Minimization in Arbitrarily Varying Markov Environments

We consider the problem of maximizing the average reward in a controlled Markov environment, which also contains some arbitrarily varying elements. This problem is captured by a two-person stochastic game model involving the reward maximizing agent and a second player, which is free to use an arbitrary (non-stationary and unpredictable) control strategy. While the minimax value of the associated zero-sum game provides a guaranteed performance level, the fact that the second player's behavior is observed as the game unfolds opens up the opportunity to improve upon this minimax value if the second player is not playing a worst-case strategy. This basic idea has been formalized in the context of repeated matrix games by the classical notions of regret minimization with respect to the Bayes envelope, where an attainable performance goal is defined in terms of the empirical frequencies of the opponent's actions. This paper presents an extension of these ideas to problems with Markovian dynamics, under appropriate recurrence conditions. The Bayes envelope is first defined in a natural way in terms of the observed state action frequencies. As this envelope may not be attained in general, we define a proper convexification thereof as an attainable solution concept. In the specific case of single-controller games, where the opponent alone controls the state transitions, the Bayes envelope itself turns out to be convex and attainable. Some concrete examples are shown to fit in this framework.

[1]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[2]  A. Shwartz,et al.  Guaranteed performance regions in Markovian systems with competing decision makers , 1993, IEEE Trans. Autom. Control..

[3]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[4]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[5]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[6]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[7]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[8]  A. Rustichini Minimizing Regret : The General Case , 1999 .

[9]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[10]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[11]  Shie Mannor,et al.  Regret Minimization in Signal Space for Repeated Matrix Games with Partial Observations , 2000 .

[12]  D. Bertsekas,et al.  Stochastic Shortest Path Games , 1999 .

[13]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[14]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[15]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[16]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[17]  D. Fudenberg,et al.  Consistency and Cautious Fictitious Play , 1995 .

[18]  Dimitri P. Bertsekas,et al.  Stochastic shortest path games: theory and algorithms , 1997 .