Soft-Robust Actor-Critic Policy-Gradient

Robust Reinforcement Learning aims to derive optimal behavior that accounts for model uncertainty in dynamical systems. However, previous studies have shown that by considering the worst case scenario, robust policies can be overly conservative. Our soft-robust framework is an attempt to overcome this issue. In this paper, we present a novel Soft-Robust Actor-Critic algorithm (SR-AC). It learns an optimal policy with respect to a distribution over an uncertainty set and stays robust to model uncertainty but avoids the conservativeness of robust strategies. We show the convergence of SR-AC and test the efficiency of our approach on different domains by comparing it against regular learning methods and their robust formulations.

[1]  Carlos S. Kubrusly,et al.  Stochastic approximation algorithms and applications , 1973, CDC 1973.

[2]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[3]  P. H. Friesen,et al.  Innovation in Conservative and Entrepreneurial Firms: Two Models of Strategic Momentum , 1982 .

[4]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[5]  V. Borkar Stochastic approximation with two time scales , 1997 .

[6]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[7]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[9]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[10]  Garud Iyengar,et al.  Robust Dynamic Programming , 2005, Math. Oper. Res..

[11]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[12]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[13]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[14]  Shie Mannor,et al.  Distributionally Robust Markov Decision Processes , 2010, Math. Oper. Res..

[15]  Shie Mannor,et al.  Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty , 2012, ICML.

[16]  Robert Babuska,et al.  A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[17]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[18]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[19]  O. Mitchell,et al.  The market for retirement financial advice , 2013 .

[20]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[21]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[22]  Shie Mannor,et al.  Bayesian Reinforcement Learning: A Survey , 2015, Found. Trends Mach. Learn..

[23]  Shie Mannor,et al.  Policy Gradient for Coherent Risk Measures , 2015, NIPS.

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Michal Valko,et al.  Bayesian Policy Gradient and Actor-Critic Algorithms , 2016, J. Mach. Learn. Res..

[26]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[27]  Shie Mannor,et al.  Robust MDPs with k-Rectangular Uncertainty , 2016, Math. Oper. Res..

[28]  Shie Mannor,et al.  Reinforcement Learning in Robust Markov Decision Processes , 2013, Math. Oper. Res..

[29]  Huan Xu,et al.  Distributionally Robust Counterpart in Markov Decision Processes , 2015, IEEE Transactions on Automatic Control.

[30]  Shie Mannor,et al.  Deep Robust Kalman Filter , 2017, ArXiv.

[31]  Aurko Roy,et al.  Reinforcement Learning under Model Mismatch , 2017, NIPS.

[32]  Shie Mannor,et al.  Learning Robust Options , 2018, AAAI.