Twice regularized MDPs and the equivalence between robustness and regularization

Robust Markov decision processes (MDPs) aim to handle changing or partially known system dynamics. To solve them, one typically resorts to robust optimization methods. However, this significantly increases computational complexity and limits scalability in both learning and planning. On the other hand, regularized MDPs show more stability in policy learning without impairing time complexity. Yet, they generally do not encompass uncertainty in the model dynamics. In this work, we aim to learn robust MDPs using regularization. We first show that regularized MDPs are a particular instance of robust MDPs with uncertain reward. We thus establish that policy iteration on reward-robust MDPs can have the same time complexity as on regularized MDPs. We further extend this relationship to MDPs with uncertain transitions: this leads to a regularization term with an additional dependence on the value function. We finally generalize regularized MDPs to twice regularized MDPs (R2 MDPs), i.e., MDPs with both value and policy regularization. The corresponding Bellman operators enable developing policy iteration schemes with convergence and robustness guarantees. It also reduces planning and learning in robust MDPs to regularized MDPs.

[1]  Sergey Levine,et al.  Maximum Entropy RL (Provably) Solves Some Robust RL Problems , 2021, ArXiv.

[2]  Shie Mannor,et al.  Distributional Robustness and Regularization in Reinforcement Learning , 2020, ArXiv.

[3]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[4]  Ryota Tomioka,et al.  Regularized Policies are Reward Robust , 2021, AISTATS.

[5]  Matthieu Geist,et al.  Approximate modified policy iteration and its application to the game of Tetris , 2015, J. Mach. Learn. Res..

[6]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[7]  Shie Mannor,et al.  Robustness and Regularization of Support Vector Machines , 2008, J. Mach. Learn. Res..

[8]  Shie Mannor,et al.  Action Robust Reinforcement Learning and Applications in Continuous Control , 2019, ICML.

[9]  Yuxin Chen,et al.  Fast Global Convergence of Natural Policy Gradient Methods with Entropy Regularization , 2020, Oper. Res..

[10]  Panos M. Pardalos,et al.  Convex optimization theory , 2010, Optim. Methods Softw..

[11]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[12]  Daniel Kuhn,et al.  Distributionally Robust Logistic Regression , 2015, NIPS.

[13]  Yoram Singer,et al.  Efficient projections onto the l1-ball for learning in high dimensions , 2008, ICML '08.

[14]  Andrew J. Schaefer,et al.  Robust Modified Policy Iteration , 2013, INFORMS J. Comput..

[15]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[16]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[17]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[18]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[19]  Zhihua Zhang,et al.  Non-asymptotic Performances of Robust Markov Decision Processes , 2021, ArXiv.

[20]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[21]  Marek Petrik,et al.  Fast Bellman Updates for Robust MDPs , 2018, ICML.

[22]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[23]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[24]  Aurko Roy,et al.  Reinforcement Learning under Model Mismatch , 2017, NIPS.

[25]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[26]  Arthur Mensch,et al.  Differentiable Dynamic Programming for Structured Prediction and Attention , 2018, ICML.

[27]  Christian Kroer,et al.  Scalable First-Order Methods for Robust MDPs , 2021, AAAI.

[28]  Shie Mannor,et al.  Soft-Robust Actor-Critic Policy-Gradient , 2018, UAI.

[29]  Daniel Kuhn,et al.  Robust Markov Decision Processes , 2013, Math. Oper. Res..

[30]  Leon Hirsch,et al.  Fundamentals Of Convex Analysis , 2016 .

[31]  B. Parlett The Rayleigh Quotient Iteration and Some Generalizations for Nonnormal Matrices , 1974 .

[32]  Viet Anh Nguyen,et al.  Wasserstein Distributionally Robust Optimization: Theory and Applications in Machine Learning , 2019, Operations Research & Management Science in the Age of Analytics.

[33]  Shie Mannor,et al.  Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty , 2012, ICML.

[34]  Shie Mannor,et al.  Learning Robust Options , 2018, AAAI.

[35]  Kyungjae Lee,et al.  Sparse Markov Decision Processes With Causal Sparse Tsallis Entropy Regularization for Reinforcement Learning , 2018, IEEE Robotics and Automation Letters.

[36]  Andrew Y. Ng,et al.  Solving Uncertain Markov Decision Processes , 2001 .

[37]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[38]  Shie Mannor,et al.  Robust MDPs with k-Rectangular Uncertainty , 2016, Math. Oper. Res..

[39]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[40]  Dileep Kalathil,et al.  Model-Free Robust Reinforcement Learning with Linear Function Approximation , 2020, ArXiv.

[41]  Bo Dai,et al.  Reinforcement Learning via Fenchel-Rockafellar Duality , 2020, ArXiv.