Model-Free Risk-Sensitive Reinforcement Learning

Risk-sensitivity, the susceptibility to the higher-order moments of the return, is necessary for the real-world deployment of AI agents. Wrong assumptions, lack of data, misspecification, limited computation, and adversarial attacks are just a handful of the countless sources of unforeseen perturbations that could be present at deployment time. Such perturbations can easily destabilize risk-neutral policies, because they only focus on maximizing expected return while entirely neglecting the variance. This poses serious safety concerns [Amodei et al., 2016, Leike et al., 2017, Russell et al., 2015].

[1]  Daniel A. Braun,et al.  Thermodynamics as a theory of decision-making with information-processing costs , 2012, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[4]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[5]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[6]  Ryota Tomioka,et al.  Regularized Policies are Reward Robust , 2021, AISTATS.

[7]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[8]  Samuel J Gershman,et al.  Do learning rates adapt to the distribution of rewards? , 2015, Psychonomic bulletin & review.

[9]  Daniel Polani,et al.  Information Theory of Decisions and Actions , 2011 .

[10]  Yoshua Bengio,et al.  Série Scientifique Scientific Series Incorporating Second-order Functional Knowledge for Better Option Pricing Incorporating Second-order Functional Knowledge for Better Option Pricing , 2022 .

[11]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[12]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Stuart J. Russell,et al.  Research Priorities for Robust and Beneficial Artificial Intelligence , 2015, AI Mag..

[15]  G. Hunanyan,et al.  Portfolio Selection , 2019, Finanzwirtschaft, Banken und Bankmanagement I Finance, Banks and Bank Management.

[16]  P. Dayan,et al.  Neural Prediction Errors Reveal a Risk-Sensitive Reinforcement-Learning Process in the Human Brain , 2012, The Journal of Neuroscience.

[17]  Daniel D. Lee,et al.  An Adversarial Interpretation of Information-Theoretic Bounded Rationality , 2014, AAAI.

[18]  Rémi Munos,et al.  Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[19]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20]  R. Rescorla,et al.  A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[21]  Shie Mannor,et al.  Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[22]  Klaus Obermayer,et al.  Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[23]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[24]  Sergey Levine,et al.  If MaxEnt RL is the Answer, What is the Question? , 2019, ArXiv.

[25]  Daniel A. Braun,et al.  Information, Utility and Bounded Rationality , 2011, AGI.

[26]  Jordi Grau-Moya,et al.  Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes , 2016, ECML/PKDD.

[27]  Richard S. Sutton,et al.  Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[28]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[29]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[30]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[31]  Michèle Sebag,et al.  Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[32]  Shie Mannor,et al.  A General Approach to Multi-Armed Bandits Under Risk Criteria , 2018, COLT.

[33]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.