A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning

Abstract Deep reinforcement learning (DRL) algorithms with experience replays have been used to solve many sequential learning problems. However, in practice, DRL algorithms still suffer from the data inefficiency problem, which limits their applicability in many scenarios, and renders them inefficient in solving real-world problems. To improve the data efficiency of DRL, in this paper, a new multi-step method is proposed. Unlike traditional algorithms, the proposed method uses a new return function, which alters the discount of future rewards while decreasing the impact of the immediate reward when selecting the current state action. This approach has the potential to improve the efficiency of reward data. By combining the proposed method with classic DRL algorithms, deep Q-networks (DQN) and double deep Q-networks (DDQN), two novel algorithms are proposed for improving the efficiency of learning from experience replay. The performance of the proposed algorithms, expected n-step DQN (EnDQN) and expected n-step DDQN (EnDDQN), are validated using two simulation environments, CartPole and DeepTraffic. The experimental results demonstrate that the proposed multi-step methods greatly improve the data efficiency of DRL agents while further improving the performance of existing classic DRL algorithms when incorporated into their training.

[1]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[2]  Hui Gao,et al.  A proactive decision support method based on deep reinforcement learning and state partition , 2017, Knowl. Based Syst..

[3]  Kenji Doya,et al.  Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[4]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[5]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[6]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[7]  José David Martín-Guerrero,et al.  Online fitted policy iteration based on extreme learning machines , 2016, Knowl. Based Syst..

[8]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[9]  Richard S. Sutton,et al.  Reinforcement learning architectures for animats , 1991 .

[10]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[11]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[12]  Glen Berseth,et al.  Terrain-adaptive locomotion skills using deep reinforcement learning , 2016, ACM Trans. Graph..

[13]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[14]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[15]  Jan Peters,et al.  Reinforcement Learning to Adjust Robot Movements to New Situations , 2010, IJCAI.

[16]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[17]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[18]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[19]  Ai Poh Loh,et al.  Model-based contextual policy search for data-efficient generalization of robot skills , 2017, Artif. Intell..

[20]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[21]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[22]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[23]  Dan Xia,et al.  Learning classifier system with average reward reinforcement learning , 2013, Knowl. Based Syst..

[24]  Raquel Dormido,et al.  Tackling the start-up of a reinforcement learning agent for the control of wastewater treatment plants , 2017, Knowl. Based Syst..

[25]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[26]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Stefano Palminteri,et al.  The Computational Development of Reinforcement Learning during Adolescence , 2016, PLoS Comput. Biol..

[28]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[29]  Marc Peter Deisenroth,et al.  Deep Reinforcement Learning: A Brief Survey , 2017, IEEE Signal Processing Magazine.

[30]  Daniel Marco,et al.  Markov Random Processes Are Neither Bandlimited nor Recoverable From Samples or After Quantization , 2009, IEEE Transactions on Information Theory.