论文信息 - Model-Free Risk-Sensitive Reinforcement Learning

Model-Free Risk-Sensitive Reinforcement Learning

Risk-sensitivity, the susceptibility to the higher-order moments of the return, is necessary for the real-world deployment of AI agents. Wrong assumptions, lack of data, misspecification, limited computation, and adversarial attacks are just a handful of the countless sources of unforeseen perturbations that could be present at deployment time. Such perturbations can easily destabilize risk-neutral policies, because they only focus on maximizing expected return while entirely neglecting the variance. This poses serious safety concerns [Amodei et al., 2016, Leike et al., 2017, Russell et al., 2015].

[1] Daniel A. Braun,et al. Thermodynamics as a theory of decision-making with information-processing costs , 2012, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[2] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[3] Stefan Schaal,et al. A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[4] Emanuel Todorov,et al. Linearly-solvable Markov decision problems , 2006, NIPS.

[5] Laurent El Ghaoui,et al. Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[6] Ryota Tomioka,et al. Regularized Policies are Reward Robust , 2021, AISTATS.

[7] Vicenç Gómez,et al. Optimal control as a graphical model inference problem , 2009, Machine Learning.

[8] Samuel J Gershman,et al. Do learning rates adapt to the distribution of rewards? , 2015, Psychonomic bulletin & review.

[9] Daniel Polani,et al. Information Theory of Decisions and Actions , 2011 .

[10] Yoshua Bengio,et al. Série Scientifique Scientific Series Incorporating Second-order Functional Knowledge for Better Option Pricing Incorporating Second-order Functional Knowledge for Better Option Pricing , 2022 .

[11] John N. Tsitsiklis,et al. Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[12] P. Schrimpf,et al. Dynamic Programming , 2011 .

[13] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[14] Stuart J. Russell,et al. Research Priorities for Robust and Beneficial Artificial Intelligence , 2015, AI Mag..

[15] G. Hunanyan,et al. Portfolio Selection , 2019, Finanzwirtschaft, Banken und Bankmanagement I Finance, Banks and Bank Management.

[16] P. Dayan,et al. Neural Prediction Errors Reveal a Risk-Sensitive Reinforcement-Learning Process in the Human Brain , 2012, The Journal of Neuroscience.

[17] Daniel D. Lee,et al. An Adversarial Interpretation of Information-Theoretic Bounded Rationality , 2014, AAAI.

[18] Rémi Munos,et al. Recurrent Experience Replay in Distributed Reinforcement Learning , 2018, ICLR.

[19] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[20] R. Rescorla,et al. A theory of Pavlovian conditioning : Variations in the effectiveness of reinforcement and nonreinforcement , 1972 .

[21] Shie Mannor,et al. Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[22] Klaus Obermayer,et al. Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[23] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[24] Sergey Levine,et al. If MaxEnt RL is the Answer, What is the Question? , 2019, ArXiv.

[25] Daniel A. Braun,et al. Information, Utility and Bounded Rationality , 2011, AGI.

[26] Jordi Grau-Moya,et al. Planning with Information-Processing Constraints and Model Uncertainty in Markov Decision Processes , 2016, ECML/PKDD.

[27] Richard S. Sutton,et al. Time-Derivative Models of Pavlovian Reinforcement , 1990 .

[28] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[29] H. Kappen. Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[30] Javier García,et al. A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[31] Michèle Sebag,et al. Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[32] Shie Mannor,et al. A General Approach to Multi-Armed Bandits Under Risk Criteria , 2018, COLT.

[33] Marc Toussaint,et al. Robot trajectory optimization using approximate inference , 2009, ICML '09.