Success Probability of Exploration: a Concrete Analysis of Learning Efficiency

Exploration has been a crucial part of reinforcement learning, yet several important questions concerning exploration efficiency are still not answered satisfactorily by existing analytical frameworks. These questions include exploration parameter setting, situation analysis, and hardness of MDPs, all of which are unavoidable for practitioners. To bridge the gap between the theory and practice, we propose a new analytical framework called the success probability of exploration. We show that those important questions of exploration above can all be answered under our framework, and the answers provided by our framework meet the needs of practitioners better than the existing ones. More importantly, we introduce a concrete and practical approach to evaluating the success probabilities in certain MDPs without the need of actually running the learning algorithm. We then provide empirical results to verify our approach, and demonstrate how the success probability of exploration can be used to analyse and predict the behaviours and possible outcomes of exploration, which are the keys to the answer of the important questions of exploration.

[1]  Pascal Poupart,et al.  Bayesian Reinforcement Learning , 2010, Encyclopedia of Machine Learning.

[2]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[3]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[4]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[5]  Csaba Szepesvári,et al.  Model-based reinforcement learning with nearly tight exploration complexity bounds , 2010, ICML.

[6]  Xin Yao,et al.  Increasingly Cautious Optimism for Practical PAC-MDP Exploration , 2015, IJCAI.

[7]  Steven D. Whitehead,et al.  Complexity and Cooperation in Q-Learning , 1991, ML.

[8]  Ronald Ortner,et al.  Online Regret Bounds for Undiscounted Continuous Reinforcement Learning , 2012, NIPS.

[9]  Peter Auer,et al.  Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning , 2006, NIPS.

[10]  Amir Massoud Farahmand,et al.  Action-Gap Phenomenon in Reinforcement Learning , 2011, NIPS.

[11]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[12]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[13]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[14]  Michael L. Littman,et al.  An empirical evaluation of interval estimation for Markov decision processes , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[15]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[16]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[17]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[18]  Martin A. Riedmiller,et al.  Reinforcement learning for robot soccer , 2009, Auton. Robots.

[19]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[20]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21]  Shie Mannor,et al.  How hard is my MDP?" The distribution-norm to the rescue" , 2014, NIPS.

[22]  Tor Lattimore,et al.  Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[23]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[24]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[25]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[26]  Lihong Li,et al.  Sample Complexity Bounds of Exploration , 2012, Reinforcement Learning.

[27]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[28]  Michael L. Littman,et al.  A theoretical analysis of Model-Based Interval Estimation , 2005, ICML.

[29]  Shimon Whiteson,et al.  V-MAX: tempered optimism for better PAC reinforcement learning , 2012, AAMAS.

[30]  Lihong Li,et al.  Reinforcement Learning in Finite MDPs: PAC Analysis , 2009, J. Mach. Learn. Res..

[31]  G. Oehlert A note on the delta method , 1992 .

[32]  Claudia Perlich Learning Curves in Machine Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[33]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.