Addressing Function Approximation Error in Actor-Critic Methods

In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.

[1]  Long Ji Lin,et al.  Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[2]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[3]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[4]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[5]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[6]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[7]  Arash Tavakoli,et al.  Action Branching Architectures for Deep Reinforcement Learning , 2017, AAAI.

[8]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[9]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[10]  Marek Petrik,et al.  Biasing Approximate Dynamic Programming with a Lower Discount Factor , 2008, NIPS.

[11]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[12]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[13]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[14]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[15]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[16]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[17]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[18]  Dale Schuurmans,et al.  Smoothed Action Value Functions for Learning Gaussian Policies , 2018, ICML.

[19]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[20]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Warren B. Powell,et al.  Bias-corrected Q-learning to control max-operator bias in Q-learning , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[23]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[24]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[25]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[26]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[27]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[28]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[29]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[30]  Nahum Shimkin,et al.  Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[31]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[32]  Richard S. Sutton,et al.  Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[33]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[34]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[35]  Yang Liu,et al.  Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening , 2016, ICLR.

[36]  Navdeep Jaitly,et al.  Discrete Sequential Prediction of Continuous Actions for Deep RL , 2017, ArXiv.

[37]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[38]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[39]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[40]  Yuval Tassa,et al.  Data-efficient Deep Reinforcement Learning for Dexterous Manipulation , 2017, ArXiv.

[41]  R. Mazo On the theory of brownian motion , 1973 .

[42]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[43]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[44]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[45]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[46]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[47]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[48]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[49]  Mark D. Pendrith,et al.  Estimator Variance in Reinforcement Learning: Theoretical Problems and Practical Solutions , 1997 .