Increasing the Action Gap: New Operators for Reinforcement Learning

This paper introduces new optimality-preserving operators on Q-functions. We first describe an operator for tabular representations, the consistent Bellman operator, which incorporates a notion of local policy consistency. We show that this local consistency leads to an increase in the action gap at each state; increasing this gap, we argue, mitigates the undesirable effects of approximation and estimation errors on the induced greedy policies. This operator can also be applied to discretized continuous space and time problems, and we provide empirical results evidencing superior performance in this context. Extending the idea of a locally consistent operator, we then derive sufficient conditions for an operator to preserve optimality, leading to a family of operators which includes our consistent Bellman operator. As corollaries we provide a proof of optimality for Baird's advantage learning algorithm and derive other gap-increasing operators with interesting properties. We conclude with an empirical study on 60 Atari 2600 games illustrating the strong potential of these new operators.

[1]  Christopher G. Atkeson,et al.  Using locally weighted regression for robot learning , 1991, Proceedings. 1991 IEEE International Conference on Robotics and Automation.

[2]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[3]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, International Conference on Machine Learning.

[4]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[5]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[6]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  Andrew W. Moore,et al.  Barycentric Interpolators for Continuous Space and Time Reinforcement Learning , 1998, NIPS.

[9]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[10]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[11]  L. Baird Reinforcement Learning Through Gradient Descent , 1999 .

[12]  H. Kushner Numerical Methods for Stochastic Control Problems in Continuous Time , 2000 .

[13]  Andrew W. Moore,et al.  Variable Resolution Discretization in Optimal Control , 2002, Machine Learning.

[14]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[16]  Richard S. Sutton,et al.  Learning to Predict by the Methods of Temporal Differences , 1988, Machine Learning.

[17]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[18]  Julian Togelius,et al.  Super mario evolution , 2009, 2009 IEEE Symposium on Computational Intelligence and Games.

[19]  Russ Tedrake,et al.  System Identification of Post Stall Aerodynamics for UAV Perching , 2009 .

[20]  Martin A. Riedmiller,et al.  Reinforcement learning for robot soccer , 2009, Auton. Robots.

[21]  Hado van Hasselt Double Q-learning , 2010, NIPS.

[22]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2012, 49th IEEE Conference on Decision and Control (CDC).

[23]  Hilbert J. Kappen,et al.  Speedy Q-Learning , 2011, NIPS.

[24]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[25]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[26]  P. Schrimpf Dynamic Programming , 2011 .

[27]  D. Bertsekas Approximate policy iteration: a survey and some new methods , 2011 .

[28]  Amir Massoud Farahmand,et al.  Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[29]  Warren B. Powell,et al.  Bias-corrected Q-learning to control max-operator bias in Q-learning , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[30]  Warren B. Powell,et al.  Optimal Hour-Ahead Bidding in the Real-Time Electricity Market with Battery Storage Using Approximate Dynamic Programming , 2014, INFORMS J. Comput..

[31]  Marc G. Bellemare,et al.  Compress and Control , 2015, AAAI.

[32]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[33]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[34]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.