Exploration Conscious Reinforcement Learning Revisited

The Exploration-Exploitation tradeoff arises in Reinforcement Learning when one cannot tell if a policy is optimal. Then, there is a constant need to explore new actions instead of exploiting past experience. In practice, it is common to resolve the tradeoff by using a fixed exploration mechanism, such as $\epsilon$-greedy exploration or by adding Gaussian noise, while still trying to learn an optimal policy. In this work, we take a different approach and study exploration-conscious criteria, that result in optimal policies with respect to the exploration mechanism. Solving these criteria, as we establish, amounts to solving a surrogate Markov Decision Process. We continue and analyze properties of exploration-conscious optimal policies and characterize two general approaches to solve such criteria. Building on the approaches, we apply simple changes in existing tabular and deep Reinforcement Learning algorithms and empirically demonstrate superior performance relatively to their non-exploration-conscious counterparts, both for discrete and continuous action spaces.

[1]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[2]  Benjamin Van Roy,et al.  (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.

[3]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[4]  Dale Schuurmans,et al.  Smoothed Action Value Functions for Learning Gaussian Policies , 2018, ICML.

[5]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[6]  Csaba Szepesv Ari,et al.  Generalized Markov Decision Processes: Dynamic-programming and Reinforcement-learning Algorithms , 1996 .

[7]  Shie Mannor,et al.  Beyond the One Step Greedy Approach in Reinforcement Learning , 2018, ICML.

[8]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[9]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[10]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[11]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[14]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[15]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[16]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[17]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[18]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[21]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[22]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[23]  Le Song,et al.  SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation , 2017, ICML.

[24]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[25]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[26]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[27]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[28]  Marc G. Bellemare,et al.  Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[29]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[30]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[31]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[32]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[33]  Shimon Whiteson,et al.  A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[34]  George H. John When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[35]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[36]  Marek Petrik,et al.  Biasing Approximate Dynamic Programming with a Lower Discount Factor , 2008, NIPS.

[37]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[38]  Nan Jiang,et al.  On Structural Properties of MDPs that Bound Loss Due to Shallow Planning , 2016, IJCAI.

[39]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..