Distributional Reinforcement Learning with Quantile Regression

In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51.

[1]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[2]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[3]  S. Goldstein,et al.  On intrinsic randomness of dynamical systems , 1981 .

[4]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[5]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[6]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[7]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[8]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[9]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[10]  James W. Taylor A Quantile Regression Approach to Estimating the Distribution of Multiperiod Returns , 1999 .

[11]  Peter J. Bickel,et al.  The Earth Mover's distance is the Mallows distance: some insights from statistics , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[12]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[13]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[16]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[17]  Ronny Luss,et al.  Sparse Quantile Huber Regression for Efficient and Robust Estimation , 2014, ArXiv.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[20]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[21]  Shie Mannor,et al.  Risk-Sensitive and Robust Decision-Making: a CVaR Optimization Approach , 2015, NIPS.

[22]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[23]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[24]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[25]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[26]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[27]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[28]  Marc G. Bellemare,et al.  The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[29]  Naomi S. Altman,et al.  Quantile regression , 2019, Nature Methods.