Parametric regret in uncertain Markov decision processes

We consider decision making in a Markovian setup where the reward parameters are not known in advance. Our performance criterion is the gap between the performance of the best strategy that is chosen after the true parameter realization is revealed and the performance of the strategy that is chosen before the parameter realization is revealed. We call this gap the parametric regret. We consider two related problems: minimax regret and mean-variance tradeoff of the regret. The minimax regret strategy minimizes the worst-case regret under the most adversarial possible realization. We show that the problem of computing a minimax regret strategy is NP-hard and propose algorithms to efficiently finding it under favorable conditions. The mean-variance tradeoff formulation requires a probabilistic model of the uncertain parameters and looks for a strategy that minimizes a convex combination of the mean and the variance of the regret. We prove that computing such a strategy can be done numerically in an efficient way.

[1]  H. Levy,et al.  Approximating Expected Utility by a Function of Mean and Variance , 1979 .

[2]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[3]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[4]  Arkadi Nemirovski,et al.  Robust solutions of uncertain linear programs , 1999, Oper. Res. Lett..

[5]  Andrew Y. Ng,et al.  Solving Uncertain Markov Decision Processes , 2001 .

[6]  Marion Kee,et al.  Analysis , 2004, Machine Translation.

[7]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[8]  Chelsea C. White,et al.  Parameter Imprecision in Finite State, Finite Action Dynamic Programs , 1986, Oper. Res..

[9]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[10]  J. MacQueen A MODIFIED DYNAMIC PROGRAMMING METHOD FOR MARKOVIAN DECISION PROBLEMS , 1966 .

[11]  Salil P. Vadhan,et al.  Computational Complexity , 2005, Encyclopedia of Cryptography and Security.

[12]  Philip Wolfe,et al.  Contributions to the theory of games , 1953 .

[13]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[14]  Craig Boutilier,et al.  Regret-based Reward Elicitation for Markov Decision Processes , 2009, UAI.

[15]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[16]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[17]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[18]  Stephen A. Ross,et al.  Equilibrium and Agency--Inadmissible Agents in the Public Agency Problem , 1979 .

[19]  Yuval Rabani,et al.  Linear Programming , 2007, Handbook of Approximation Algorithms and Metaheuristics.

[20]  Allen L. Soyster,et al.  Technical Note - Convex Programming with Set-Inclusive Constraints and Applications to Inexact Linear Programming , 1973, Oper. Res..

[21]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[22]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23]  Shie Mannor,et al.  Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[24]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .