Policy Gradient for Coherent Risk Measures

Several authors have recently developed risk-sensitive policy gradient methods that augment the standard expected cost minimization problem with a measure of variability in cost. These studies have focused on specific risk-measures, such as the variance or conditional value at risk (CVaR). In this work, we extend the policy gradient method to the whole class of coherent risk measures, which is widely accepted in finance and operations research, among other fields. We consider both static and time-consistent dynamic risk measures. For static risk measures, our approach is in the spirit of policy gradient algorithms and combines a standard sampling approach with convex programming. For dynamic risk measures, our approach is actor-critic style and involves explicit approximation of value function. Most importantly, our contribution presents a unified approach to risk-sensitive reinforcement learning that generalizes and extends previous results.

[1]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[2]  Marek Petrik,et al.  Tight Approximations of Dynamic Risk Measures , 2011, Math. Oper. Res..

[3]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[4]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[5]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[6]  Alexander Shapiro,et al.  On a time consistency concept in risk averse multistage stochastic programming , 2009, Oper. Res. Lett..

[7]  Abaxbank,et al.  Spectral Measures of Risk : a Coherent Representation of Subjective Risk Aversion , 2002 .

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[10]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 2001, IEEE Trans. Autom. Control..

[11]  Anthony V. Fiacco,et al.  Introduction to Sensitivity and Stability Analysis in Nonlinear Programming , 2012 .

[12]  Andrzej Ruszczynski,et al.  Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[13]  Philippe Artzner,et al.  Coherent Measures of Risk , 1999 .

[14]  B. Roorda,et al.  COHERENT ACCEPTABILITY MEASURES IN MULTIPERIOD MODELS , 2005 .

[15]  Josef Hadar,et al.  Rules for Ordering Uncertain Prospects , 1969 .

[16]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[17]  Marco Pavone,et al.  A framework for time-consistent, risk-averse model predictive control: Theory and algorithms , 2014, 2014 American Control Conference.

[18]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[19]  S. Rachev,et al.  Stable Paretian Models in Finance , 2000 .

[20]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[21]  Nicole Bäuerle,et al.  Markov Decision Processes with Average-Value-at-Risk criteria , 2011, Math. Methods Oper. Res..

[22]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[23]  Dale Schuurmans,et al.  Learning Exercise Policies for American Options , 2009, AISTATS.

[24]  Marek Petrik,et al.  An Approximate Solution Method for Large Risk-Averse Markov Decision Processes , 2012, UAI.

[25]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[26]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[27]  Gilles Pagès,et al.  Computing VaR and CVaR using stochastic approximation and adaptive unconstrained importance sampling , 2008, Monte Carlo Methods Appl..

[28]  Paul R. Milgrom,et al.  Envelope Theorems for Arbitrary Choice Sets , 2002 .

[29]  A. Stuart,et al.  Portfolio Selection: Efficient Diversification of Investments , 1959 .

[30]  Fanwen Meng,et al.  A Regularized Sample Average Approximation Method for Stochastic Mathematical Programs with Nonsmooth Equality Constraints , 2006, SIAM J. Optim..

[31]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[32]  Jonas Schmitt Portfolio Selection Efficient Diversification Of Investments , 2016 .

[33]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[34]  Vivek S. Borkar,et al.  A sensitivity formula for risk-sensitive cost and the actor-critic algorithm , 2001, Syst. Control. Lett..

[35]  Mohammad Ghavamzadeh,et al.  Algorithms for CVaR Optimization in MDPs , 2014, NIPS.

[36]  Mohammad Ghavamzadeh,et al.  Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[37]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[38]  Takayuki Osogami,et al.  Robustness and risk-sensitivity in Markov decision processes , 2012, NIPS.

[39]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[40]  Alexander Shapiro,et al.  Optimization of Convex Risk Functions , 2006, Math. Oper. Res..

[41]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[42]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[43]  Matthew Saffell,et al.  Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.