The Empirical Bayes Envelope and Regret Minimization in Competitive Markov Decision Processes

This paper proposes an extension of the regret minimizing framework from repeated matrix games to stochastic game models, under appropriate recurrence conditions. A decision maker, P1, who wishes to maximize his long-term average reward is facing a Markovian environment, which may also be affected by arbitrary actions of other agents. The latter are collectively modeled as a second player, P2, whose strategy is arbitrary. Both states and actions are fully observed by both players. While P1 may obviously secure the min-max value of the game, he may wish to improve on that when the opponent is not playing a worst-case strategy. For repeated matrix games, an achievable goal is presented by the Bayes envelope, that traces P1's best-response payoff against the observable frequencies of P2's actions. We propose a generalization to the stochastic game framework, under recurrence conditions that amount to fixed-state reachability. The empirical Bayes envelope (EBE) is defined as P1's best-response payoff against the stationary strategies of P2 that agree with the observed state-action frequencies. Because the EBE may not be attainable in general, we consider its lower convex hull, the convex Bayes envelope (CBE), which is proved to be achievable by P1. The analysis relies on Blackwell's approachability theory. The CBE is lower bounded by the value of the game and for irreducible games turns out to be strictly above the value whenever P2's frequencies deviate from a worst-case strategy. In the special case of single-controller games where P2 alone affects the state transitions, the EBE itself is shown to be attainable.

[1]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[2]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[3]  J. Stoer,et al.  Convexity and Optimization in Finite Dimensions I , 1970 .

[4]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[5]  A. Shwartz,et al.  Guaranteed performance regions in Markovian systems with competing decision makers , 1993, IEEE Trans. Autom. Control..

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[8]  D. Fudenberg,et al.  Consistency and Cautious Fictitious Play , 1995 .

[9]  Vladimir Vovk,et al.  A game of prediction with expert advice , 1995, COLT '95.

[10]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[11]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[12]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[13]  Dimitri P. Bertsekas,et al.  Stochastic shortest path games: theory and algorithms , 1997 .

[14]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[15]  O. J. Vrieze,et al.  Simplifying Optimal Strategies in Stochastic Games , 1998 .

[16]  Neri Merhav,et al.  Universal Prediction , 1998, IEEE Trans. Inf. Theory.

[17]  Prakash Narayan,et al.  Reliable Communication Under Channel Uncertainty , 1998, IEEE Trans. Inf. Theory.

[18]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[19]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[20]  A. Rustichini Minimizing Regret : The General Case , 1999 .

[21]  S. Hart,et al.  A General Class of Adaptive Strategies , 1999 .

[22]  D. Bertsekas,et al.  Stochastic Shortest Path Games , 1999 .

[23]  Andreu Mas-Colell,et al.  A General Class of Adaptive Strategies , 1999, J. Econ. Theory.