PGQ: Combining policy gradient and Q-learning

Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between the fixed points of the regularized policy gradient algorithm and the Q-values. This connection allows us to estimate the Q-values from the action preferences of the policy, to which we apply Q-learning updates. We refer to the new technique as 'PGQL', for policy gradient and Q-learning. We also establish an equivalency between action-value fitting techniques and actor-critic algorithms, showing that regularized policy gradient techniques can be interpreted as advantage function learning algorithms. We conclude with some numerical examples that demonstrate improved data efficiency and stability of PGQL. In particular, we tested PGQL on the full suite of Atari games and achieved performance exceeding that of both asynchronous advantage actor-critic (A3C) and Q-learning.

[1]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[2]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[3]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[4]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[5]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[6]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[7]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[8]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[9]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[12]  Naoki Abe,et al.  Sequential cost-sensitive decision making with reinforcement learning , 2002, KDD.

[13]  Jeff G. Schneider,et al.  Covariant Policy Search , 2003, IJCAI.

[14]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[15]  Geoffrey E. Hinton,et al.  Reinforcement Learning with Factored States and Actions , 2004, J. Mach. Learn. Res..

[16]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[17]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[18]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[21]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[22]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[23]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[24]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[25]  Yee Whye Teh,et al.  Actor-Critic Reinforcement Learning with Energy-Based Policies , 2012, EWRL.

[26]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[27]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[28]  Tzuu-Hseng S. Li,et al.  Backward Q-learning: The combination of Sarsa algorithm and Q-learning , 2013, Eng. Appl. Artif. Intell..

[29]  Philip Thomas,et al.  Bias in Natural Actor-Critic Algorithms , 2014, ICML.

[30]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[31]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[32]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[33]  Doina Precup,et al.  Policy Gradient Methods for Off-policy Control , 2015, ArXiv.

[34]  Peter Kulchyski and , 2015 .

[35]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[36]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[37]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[38]  Matthew Hausknecht and Peter Stone On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning , 2016 .

[39]  Dale Schuurmans,et al.  Reward Augmented Maximum Likelihood for Neural Structured Prediction , 2016, NIPS.

[40]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[41]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[42]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[43]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[44]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[45]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[46]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[47]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[48]  Omer Levy,et al.  Published as a conference paper at ICLR 2018 S IMULATING A CTION D YNAMICS WITH N EURAL P ROCESS N ETWORKS , 2018 .