Action-Gap Phenomenon in Reinforcement Learning

Many practitioners of reinforcement learning problems have observed that oftentimes the performance of the agent reaches very close to the optimal performance even though the estimated (action-)value function is still far from the optimal one. The goal of this paper is to explain and formalize this phenomenon by introducing the concept of the action-gap regularity. As a typical result, we prove that for an agent following the greedy policy \(\hat{\pi}\) with respect to an action-value function $\(\hat{Q}\)$, the performance loss $\(E[V^*(X) - V^{\hat{X}} (X)]\)$ is upper bounded by $\(O(|| \hat{Q} - Q^*||_\infty^{1+\zeta}\))$, in which ζ ≥ = 0) is the parameter quantifying the action-gap regularity. For ζ > 0, our results indicate smaller performance loss compared to what previous analyses had suggested. Finally, we show how this regularity affects the performance of the family of approximate value iteration algorithms.

[1]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[2]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[3]  Alex M. Andrew,et al.  ROBOT LEARNING, edited by Jonathan H. Connell and Sridhar Mahadevan, Kluwer, Boston, 1993/1997, xii+240 pp., ISBN 0-7923-9365-1 (Hardback, 218.00 Guilders, $120.00, £89.95). , 1999, Robotica (Cambridge. Print).

[4]  E. Mammen,et al.  Smooth Discrimination Analysis , 1999 .

[5]  Rémi Munos,et al.  Error Bounds for Approximate Policy Iteration , 2003, ICML.

[6]  Benjamin Van Roy,et al.  The Linear Programming Approach to Approximate Dynamic Programming , 2003, Oper. Res..

[7]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[8]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[9]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[10]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[11]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[12]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[13]  A. Tsybakov,et al.  Fast learning rates for plug-in classifiers , 2007, 0708.2321.

[14]  Rémi Munos,et al.  Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..

[15]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[16]  Shie Mannor,et al.  Regularized Policy Iteration , 2008, NIPS.

[17]  Shie Mannor,et al.  Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[18]  Andrew Y. Ng,et al.  Regularization and feature selection in least-squares temporal difference learning , 2009, ICML '09.

[19]  Marek Petrik,et al.  Constraint relaxation in approximate linear programs , 2009, ICML '09.

[20]  A. Rinaldo,et al.  Generalized density clustering , 2009, 0907.3454.

[21]  Alessandro Lazaric,et al.  Analysis of a Classification-based Policy Iteration Algorithm , 2010, ICML.

[22]  Robert E. Schapire,et al.  A Reduction from Apprenticeship Learning to Classification , 2010, NIPS.

[23]  Alessandro Lazaric,et al.  Finite-sample Analysis of Bellman Residual Minimization , 2010, ACML.

[24]  Shalabh Bhatnagar,et al.  Toward Off-Policy Learning Control with Function Approximation , 2010, ICML.

[25]  Csaba Szepesvári,et al.  Error Propagation for Approximate Policy and Value Iteration , 2010, NIPS.

[26]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.