A Bayesian Approach to Robust Reinforcement Learning

Robust Markov Decision Processes (RMDPs) intend to ensure robustness with respect to changing or adversarial system behavior. In this framework, transitions are modeled as arbitrary elements of a known and properly structured uncertainty set and a robust optimal policy can be derived under the worst-case scenario. In this study, we address the issue of learning in RMDPs using a Bayesian approach. We introduce the Uncertainty Robust Bellman Equation (URBE) which encourages safe exploration for adapting the uncertainty set to new observations while preserving robustness. We propose a URBE-based algorithm, DQN-URBE, that scales this method to higher dimensional domains. Our experiments show that the derived URBE-based strategy leads to a better trade-off between less conservative solutions and robustness in the presence of model misspecification. In addition, we show that the DQN-URBE algorithm can adapt significantly faster to changing dynamics online compared to existing robust techniques with fixed uncertainty sets.

[1]  Shie Mannor,et al.  Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty , 2012, ICML.

[2]  Marek Petrik,et al.  Tight Bayesian Ambiguity Sets for Robust MDPs , 2018, ArXiv.

[3]  W. R. Thompson ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[4]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[5]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[6]  Shie Mannor,et al.  Deep Robust Kalman Filter , 2017, ArXiv.

[7]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[8]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[9]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[10]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[11]  Shie Mannor,et al.  Learning Robust Options , 2018, AAAI.

[12]  Vineet Goyal,et al.  Robust Markov Decision Process: Beyond Rectangularity , 2018, 1811.00215.

[13]  Marek Petrik,et al.  Policy-Conditioned Uncertainty Sets for Robust Markov Decision Processes , 2018 .

[14]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[15]  Huan Xu,et al.  Distributionally Robust Counterpart in Markov Decision Processes , 2015, IEEE Transactions on Automatic Control.

[16]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[17]  Shie Mannor,et al.  Soft-Robust Actor-Critic Policy-Gradient , 2018, UAI.

[18]  Malcolm J. A. Strens,et al.  A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[19]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[20]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[21]  Shie Mannor,et al.  Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach , 2015, NIPS.

[22]  David Andre,et al.  Model based Bayesian Exploration , 1999, UAI.

[23]  Benjamin Van Roy,et al.  Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.

[24]  Marek Petrik,et al.  Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs , 2019, NeurIPS.

[25]  Shie Mannor,et al.  Reinforcement Learning in Robust Markov Decision Processes , 2013, Math. Oper. Res..

[26]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[27]  Shie Mannor,et al.  Robust MDPs with k-Rectangular Uncertainty , 2016, Math. Oper. Res..

[28]  Vishal Gupta,et al.  Near-Optimal Bayesian Ambiguity Sets for Distributionally Robust Optimization , 2019, Manag. Sci..

[29]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[30]  Aurko Roy,et al.  Reinforcement Learning under Model Mismatch , 2017, NIPS.

[31]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[32]  Shie Mannor,et al.  Reward Constrained Policy Optimization , 2018, ICLR.

[33]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[34]  Shie Mannor,et al.  Distributionally Robust Markov Decision Processes , 2010, Math. Oper. Res..