Nonparametric Return Distribution Approximation for Reinforcement Learning

Standard Reinforcement Learning (RL) aims to optimize decision-making rules in terms of the expected return. However, especially for risk-management purposes, other criteria such as the expected shortfall are sometimes preferred. Here, we describe a method of approximating the distribution of returns, which allows us to derive various kinds of information about the returns. We first show that the Bellman equation, which is a recursive formula for the expected return, can be extended to the cumulative return distribution. Then we derive a nonparametric return distribution estimator with particle smoothing based on this extended Bellman equation. A key aspect of the proposed algorithm is to represent the recursion relation in the extended Bellman equation by a simple replacement procedure of particles associated with a state by using those of the successor state. We show that our algorithm leads to a risk-sensitive RL paradigm. The usefulness of the proposed approach is demonstrated through numerical experiments.

[1]  A. Kolmogoroff Confidence Limits for an Unknown Distribution Function , 1941 .

[2]  W. Feller On the Kolmogorov–Smirnov Limit Theorems for Empirical Distributions , 1948 .

[3]  Washington Hilton NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE , 1983 .

[4]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[5]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[6]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[7]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[8]  Stuart J. Russell,et al.  Bayesian Q-Learning , 1998, AAAI/IAAI.

[9]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[10]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[11]  Makoto Sato,et al.  TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[12]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[13]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[14]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[15]  A. Moore,et al.  Learning decisions: robustness, uncertainty, and approximation , 2004 .

[16]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[17]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[18]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[19]  Mohammad Ghavamzadeh,et al.  Bayesian actor-critic algorithms , 2007, ICML '07.

[20]  Hisashi Kashima Risk-Sensitive Learning via Minimization of Empirical Conditional Value-at-Risk , 2007, IEICE Trans. Inf. Syst..

[21]  Louis Wehenkel,et al.  Risk-aware decision making and dynamic programming , 2008 .

[22]  Masashi Sugiyama,et al.  Least absolute policy iteration for robust value function approximation , 2009, 2009 IEEE International Conference on Robotics and Automation.

[23]  Shalabh Bhatnagar,et al.  Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[24]  Richard S. Sutton,et al.  Reinforcement Learning , 1992, Handbook of Machine Learning.