Stabilizing Off-Policy Reinforcement Learning with Conservative Policy Gradients

In recent years, advances in deep learning have enabled the application of reinforcement learning algorithms in complex domains. However, they lack the theoretical guarantees which are present in the tabular setting and suffer from many stability and reproducibility problems \citep{henderson2018deep}. In this work, we suggest a simple approach for improving stability and providing probabilistic performance guarantees in off-policy actor-critic deep reinforcement learning regimes. Experiments on continuous action spaces, in the MuJoCo control suite, show that our proposed method reduces the variance of the process and improves the overall performance.

[1]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[2]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[3]  Lihong Li,et al.  Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.

[4]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[5]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[6]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[7]  A. PrashanthL. Policy Gradients for CVaR-Constrained MDPs , 2014, ALT.

[8]  Xi Chen,et al.  Evolution Strategies as a Scalable Alternative to Reinforcement Learning , 2017, ArXiv.

[9]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[10]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[11]  Shie Mannor,et al.  Distributional Policy Optimization: An Alternative Approach for Continuous Control , 2019, NeurIPS.

[12]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[13]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[14]  Marek Petrik,et al.  Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.

[15]  Alessandro Lazaric,et al.  Multi-Bandit Best Arm Identification , 2011, NIPS.

[16]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[17]  Kenneth O. Stanley,et al.  On the Relationship Between the OpenAI Evolution Strategy and Stochastic Gradient Descent , 2017, ArXiv.

[18]  Kenneth O. Stanley,et al.  Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents , 2017, NeurIPS.

[19]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[20]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[21]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[22]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[23]  Kenneth O. Stanley,et al.  Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning , 2017, ArXiv.

[24]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[25]  M. T. Wasan Stochastic Approximation , 1969 .

[26]  Bruno Scherrer,et al.  Approximate Policy Iteration Schemes: A Comparison , 2014, ICML.

[27]  B. L. Welch The generalisation of student's problems when several different population variances are involved. , 1947, Biometrika.

[28]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[29]  L. A. Prashanth Policy Gradients for CVaR-Constrained MDPs , 2014, ALT 2014.

[30]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[31]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[32]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[33]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[34]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[35]  Shie Mannor,et al.  A Nonparametric Sequential Test for Online Randomized Experiments , 2016, WWW.

[36]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.