Markov Decision Processes with Arbitrary Reward Processes

We consider a control problem where the decision maker interacts with a standard Markov decision process with the exception that the reward functions vary arbitrarily over time. We extend the notion of Hannan consistency to this setting, showing that, in hindsight, the agent can perform almost as well as every deterministic policy. We present efficient online algorithms in the spirit of reinforcement learning that ensure that the agent's performance loss, or regret, vanishes over time, provided that the environment is oblivious to the agent's actions. However, counterexamples indicate that the regret does not vanish if the environment is not oblivious.

[1]  J. Renegar Some perturbation theory for linear programming , 1994, Math. Program..

[2]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[3]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[4]  Andreu Mas-Colell,et al.  A General Class of Adaptive Strategies , 1999, J. Econ. Theory.

[5]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[8]  Shie Mannor,et al.  Regret minimization in repeated matrix games with variable stage duration , 2008, Games Econ. Behav..

[9]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[10]  Sean P. Meyn,et al.  The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning , 2000, SIAM J. Control. Optim..

[11]  R. Aumann Markets with a continuum of traders , 1964 .

[12]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[13]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[14]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[15]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[16]  P. Schweitzer Perturbation theory and finite Markov chains , 1968 .

[17]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[20]  Andrew G. Barto,et al.  An Actor/Critic Algorithm that is Equivalent to Q-Learning , 1994, NIPS.

[21]  Ehud Lehrer,et al.  A wide range no-regret theorem , 2003, Games Econ. Behav..

[22]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[23]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[24]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[25]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[26]  S. M. Robinson Bounds for error in the solution set of a perturbed linear program , 1973 .

[27]  S. Bobkov,et al.  Modified Logarithmic Sobolev Inequalities in Discrete Settings , 2006 .

[28]  Ward Whitt,et al.  A Nonstationary Offered-Load Model for Packet Networks , 2001, Telecommun. Syst..

[29]  Shie Mannor,et al.  The Empirical Bayes Envelope and Regret Minimization in Competitive Markov Decision Processes , 2003, Math. Oper. Res..

[30]  Yishay Mansour,et al.  Experts in a Markov Decision Process , 2004, NIPS.

[31]  Neri Merhav,et al.  On sequential strategies for loss functions with memory , 2002, IEEE Trans. Inf. Theory.

[32]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[33]  David M. Kreps,et al.  Learning Mixed Equilibria , 1993 .

[34]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .