Optimizing the CVaR via Sampling

Conditional Value at Risk (CVaR) is a prominent risk measure that is being used extensively in various domains. We develop a new formula for the gradient of the CVaR in the form of a conditional expectation. Based on this formula, we propose a novel sampling-based estimator for the gradient of the CVaR, in the spirit of the likelihood-ratio method. We analyze the bias of the estimator, and prove the convergence of a corresponding stochastic gradient descent algorithm to a local CVaR optimum. Our method allows to consider CVaR optimization in new domains. As an example, we consider a reinforcement learning application, and learn a risk-sensitive controller for the game of Tetris.

[1]  Harley Flanders,et al.  Differentiation Under the Integral Sign , 1973 .

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[3]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[4]  Dirk P. Kroese,et al.  Simulation and the Monte Carlo Method (Wiley Series in Probability and Statistics) , 1981 .

[5]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[6]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[7]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[8]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[9]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[10]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[11]  Vivek S. Borkar,et al.  A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[12]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[13]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[14]  C. Acerbi Spectral measures of risk: A coherent representation of subjective risk aversion , 2002 .

[15]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[16]  O. Scaillet Nonparametric Estimation and Sensitivity Analysis of Expected Shortfall , 2004 .

[17]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[18]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[19]  V. Agarwal,et al.  Risks and Portfolio Decisions Involving Hedge Funds , 2004 .

[20]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[22]  Gilles Pagès,et al.  Computing VaR and CVaR using stochastic approximation and adaptive unconstrained importance sampling , 2008, Monte Carlo Methods Appl..

[23]  L. Jeff Hong,et al.  Simulating Sensitivities of Conditional Value at Risk , 2009, Manag. Sci..

[24]  Bruno Scherrer,et al.  Improvements on Learning Tetris with Cross Entropy , 2009, J. Int. Comput. Games Assoc..

[25]  Masashi Sugiyama,et al.  Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[26]  Vivek S. Borkar,et al.  Risk-constrained Markov decision processes , 2010, 49th IEEE Conference on Decision and Control (CDC).

[27]  P. Glynn IMPORTANCE SAMPLING FOR MONTE CARLO ESTIMATION OF QUANTILES , 2011 .

[28]  Nicole Bäuerle,et al.  Markov Decision Processes with Average-Value-at-Risk criteria , 2011, Math. Methods Oper. Res..

[29]  D. Barber,et al.  A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes , 2012, NIPS.

[30]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[31]  Garud Iyengar,et al.  Fast gradient descent method for Mean-CVaR optimization , 2013, Ann. Oper. Res..

[32]  Mohammad Ghavamzadeh,et al.  Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[33]  Bruno Scherrer,et al.  Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[34]  Prashanth L.A,et al.  Policy Gradients for CVaR-Constrained MDPs , 2014, 1405.2690.

[35]  A. PrashanthL. Policy Gradients for CVaR-Constrained MDPs , 2014, ALT.