Practical Risk Measures in Reinforcement Learning

Practical application of Reinforcement Learning (RL) often involves risk considerations. We study a generalized approximation scheme for risk measures, based on Monte-Carlo simulations, where the risk measures need not necessarily be \emph{coherent}. We demonstrate that, even in simple problems, measures such as the variance of the reward-to-go do not capture the risk in a satisfactory manner. In addition, we show how a risk measure can be derived from model's realizations. We propose a neural architecture for estimating the risk and suggest the risk critic architecture that can be use to optimize a policy under general risk measures. We conclude our work with experiments that demonstrate the efficacy of our approach.

[1]  Pieter Abbeel,et al.  Safe Exploration in Markov Decision Processes , 2012, ICML.

[2]  R. Rockafellar,et al.  Optimization of conditional value-at risk , 2000 .

[3]  Fernando Paganini,et al.  IEEE Transactions on Automatic Control , 2006 .

[4]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[5]  M. D. Wilkinson,et al.  Management science , 1989, British Dental Journal.

[6]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[7]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[8]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[9]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[10]  E. Altman Constrained Markov Decision Processes , 1999 .

[11]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[14]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[15]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[16]  Joelle Pineau,et al.  Proceedings of the Twenty-Ninth International Conference on Machine Learning , 2012 .

[17]  Daniel Gooch,et al.  Communications of the ACM , 2011, XRDS.

[18]  D. Krass,et al.  Percentile performance criteria for limiting average Markov decision processes , 1995, IEEE Trans. Autom. Control..

[19]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[20]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[21]  Sven Koenig,et al.  Functional Value Iteration for Decision-Theoretic Planning with General Utility Functions , 2006, AAAI.

[22]  Marko Bacic,et al.  Model predictive control , 2003 .

[23]  Marco Pavone,et al.  Risk-Constrained Reinforcement Learning with Percentile Risk Criteria , 2015, J. Mach. Learn. Res..

[24]  Pieter Abbeel,et al.  Constrained Policy Optimization , 2017, ICML.

[25]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[26]  David Q. Mayne,et al.  Model predictive control: Recent developments and future promise , 2014, Autom..

[27]  Shie Mannor,et al.  Optimizing the CVaR via Sampling , 2014, AAAI.

[28]  Kathleen Daly Volume 7 , 1998 .

[29]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[30]  Makoto Sato,et al.  TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[31]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[32]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[33]  Geoff Buckwell,et al.  Number (AT 2) , 1993 .

[34]  Masashi Sugiyama,et al.  Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[35]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[36]  Sven Koenig,et al.  Existence and Finiteness Conditions for Risk-Sensitive Planning: Results and Conjectures , 2005, UAI.

[37]  Peter W. Glynn,et al.  Likelihood ratio gradient estimation for stochastic systems , 1990, CACM.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[40]  Ralph Neuneier,et al.  Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[41]  R. Howard,et al.  Risk-Sensitive Markov Decision Processes , 1972 .

[42]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[43]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[44]  Shie Mannor,et al.  Policy Gradient for Coherent Risk Measures , 2015, NIPS.

[45]  S. Crawford,et al.  Volume 1 , 2012, Journal of Diabetes Investigation.

[46]  Eric Eaton,et al.  Safe Policy Search for Lifelong Reinforcement Learning with Sublinear Regret , 2015, ICML.