Log-normality and Skewness of Estimated State/Action Values in Reinforcement Learning

Under/overestimation of state/action values are harmful for reinforcement learning agents. In this paper, we show that a state/action value estimated using the Bellman equation can be decomposed to a weighted sum of path-wise values that follow log-normal distributions. Since log-normal distributions are skewed, the distribution of estimated state/action values can also be skewed, leading to an imbalanced likelihood of under/overestimation. The degree of such imbalance can vary greatly among actions and policies within a single problem instance, making the agent prone to select actions/policies that have inferior expected return and higher likelihood of overestimation. We present a comprehensive analysis to such skewness, examine its factors and impacts through both theoretical and empirical results, and discuss the possible ways to reduce its undesirable effects.

[1]  Norman C. Beaulieu,et al.  An optimal lognormal approximation to lognormal sum distributions , 2004, IEEE Transactions on Vehicular Technology.

[2]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[3]  Patrick M. Pilarski,et al.  True Online Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[4]  G. Oehlert A note on the delta method , 1992 .

[5]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[6]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[7]  Harold Hotelling,et al.  The Limits of a Measure of Skewness , 1932 .

[8]  P. Dayan The Convergence of TD(λ) for General λ , 1992, Machine Learning.

[9]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, CDC.

[10]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[11]  Michael L. Littman,et al.  Potential-based Shaping in Model-based Reinforcement Learning , 2008, AAAI.

[12]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[13]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[14]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[15]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[16]  Simon M. Lucas,et al.  A Survey of Monte Carlo Tree Search Methods , 2012, IEEE Transactions on Computational Intelligence and AI in Games.

[17]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[18]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[19]  Paul Wagner,et al.  Policy oscillation is overshooting , 2014, Neural Networks.

[20]  David P. Doane,et al.  Measuring Skewness: A Forgotten Statistic? , 2011 .

[21]  Marc G. Bellemare,et al.  Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.

[22]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[23]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[24]  Marcello Restelli,et al.  Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems , 2017, AAAI.

[25]  Dimitri P. Bertsekas,et al.  Q-learning and enhanced policy iteration in discounted dynamic programming , 2010, 49th IEEE Conference on Decision and Control (CDC).

[26]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[27]  Warren B. Powell,et al.  Bias-corrected Q-learning to control max-operator bias in Q-learning , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[28]  Leslie Pack Kaelbling,et al.  On the Complexity of Solving Markov Decision Problems , 1995, UAI.

[29]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[30]  Xin Yao,et al.  Increasingly Cautious Optimism for Practical PAC-MDP Exploration , 2015, IJCAI.

[31]  Csaba Szepesvári,et al.  The Asymptotic Convergence-Rate of Q-learning , 1997, NIPS.