论文信息 - Approximating two value functions instead of one: towards characterizing a new family of Deep Reinforcement Learning algorithms - 字舞流文

Approximating two value functions instead of one: towards characterizing a new family of Deep Reinforcement Learning algorithms

This paper makes one step forward towards characterizing a new family of \textit{model-free} Deep Reinforcement Learning (DRL) algorithms. The aim of these algorithms is to jointly learn an approximation of the state-value function ($V$), alongside an approximation of the state-action value function ($Q$). Our analysis starts with a thorough study of the Deep Quality-Value Learning (DQV) algorithm, a DRL algorithm which has been shown to outperform popular techniques such as Deep-Q-Learning (DQN) and Double-Deep-Q-Learning (DDQN) \cite{sabatelli2018deep}. Intending to investigate why DQV's learning dynamics allow this algorithm to perform so well, we formulate a set of research questions which help us characterize a new family of DRL algorithms. Among our results, we present some specific cases in which DQV's performance can get harmed and introduce a novel \textit{off-policy} DRL algorithm, called DQV-Max, which can outperform DQV. We then study the behavior of the $V$ and $Q$ functions that are learned by DQV and DQV-Max and show that both algorithms might perform so well on several DRL test-beds because they are less prone to suffer from the overestimation bias of the $Q$ function.

Gilles Louppe | Pierre Geurts | Matthia Sabatelli | Marco A. Wiering

[1] Tom Schaul,et al. Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[2] Pieter Abbeel,et al. Towards Characterizing Divergence in Deep Q-Learning , 2019, ArXiv.

[3] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..

[4] Shane Legg,et al. Noisy Networks for Exploration , 2017, ICLR.

[5] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[6] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[7] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[8] Rich Caruana,et al. Multitask Learning , 1997, Machine-mediated learning.

[9] R. Bellman. Dynamic programming. , 1957, Science.

[10] Sergey Levine,et al. Temporal Difference Models: Model-Free Deep RL for Model-Based Control , 2018, ICLR.

[11] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[12] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[13] Marco Wiering. QV(λ)-learning: A New On-policy Reinforcement Learning Algorithm , 2005 .

[14] Andrew W. Moore,et al. Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[15] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[16] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[17] Matteo Hessel,et al. Deep Reinforcement Learning and the Deadly Triad , 2018, ArXiv.

[18] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[19] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[20] Philip Bachman,et al. Deep Reinforcement Learning that Matters , 2017, AAAI.

[21] Jürgen Schmidhuber,et al. Deep learning in neural networks: An overview , 2014, Neural Networks.

[22] Gilles Louppe,et al. Deep Quality-Value (DQV) Learning , 2019, BNAIC/BENELEARN.

[23] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.