The Deep Quality-Value Family of Deep Reinforcement Learning Algorithms

We present a novel approach for learning an approximation of the optimal state-action value function (Q) in model-free Deep Reinforcement Learning (DRL). We propose to learn this approximation while simultaneously learning an approximation of the state-value function (V). We introduce two new DRL algorithms, called DQV-Learning and DQV-Max Learning, which follow this specific learning dynamic. In short, both algorithms use two neural networks for separately learning the V function and the Q function. We validate the effectiveness of this training scheme by thoroughly comparing our algorithms to DRL methods which only learn an approximation of the Q function, namely DQN and DDQN. Our results show that DQV and DQV-Max present several important benefits: they converge significantly faster, can achieve super-human performance on DRL testbeds on which DQN and DDQN failed to do so, and suffer less from the overestimation bias of the Q function.

[1]  Yuxi Li,et al.  Deep Reinforcement Learning: An Overview , 2017, ArXiv.

[2]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[3]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[4]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[5]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[6]  Marco Wiering QV(lambda)-learning: A New On-policy Reinforcement Learning Algrithm , 2005 .

[7]  Haitao Wang,et al.  Deep reinforcement learning with experience replay based on SARSA , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[8]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[9]  Marco Wiering,et al.  The QV family compared to other reinforcement learning algorithms , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[10]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[11]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[12]  Matteo Hessel,et al.  Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[15]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[16]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[17]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[18]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[19]  Pieter Abbeel,et al.  Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[20]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.