Risk-Sensitive Reinforcement Learning

Most reinforcement learning algorithms optimize the expected return of a Markov Decision Problem. Practice has taught us the lesson that this criterion is not always the most suitable because many applications require robust control strategies which also take into account the variance of the return. Classical control literature provides several techniques to deal with risk-sensitive optimization goals like the so-called worst-case optimality criterion exclusively focusing on risk-avoiding policies or classical risk-sensitive control, which transforms the returns by exponential utility functions. While the first approach is typically too restrictive, the latter suffers from the absence of an obvious way to design a corresponding model-free reinforcement learning algorithm.Our risk-sensitive reinforcement learning algorithm is based on a very different philosophy. Instead of transforming the return of the process, we transform the temporal differences during learning. While our approach reflects important properties of the classical exponential utility framework, we avoid its serious drawbacks for learning. Based on an extended set of optimality equations we are able to formulate risk-sensitive versions of various well-known reinforcement learning algorithms which converge with probability one under the usual conditions.

[1]  J. Neumann,et al.  Theory of Games and Economic Behavior. , 1945 .

[2]  Stuart E. Dreyfus,et al.  Applied Dynamic Programming , 1965 .

[3]  R. Howard,et al.  Risk-Sensitive Markov Decision Processes , 1972 .

[4]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[5]  E. Elton Modern portfolio theory and investment analysis , 1981 .

[6]  C. Watkins Learning from delayed rewards , 1989 .

[7]  J. Pratt RISK AVERSION IN THE SMALL AND IN THE LARGE11This research was supported by the National Science Foundation (grant NSF-G24035). Reproduction in whole or in part is permitted for any purpose of the United States Government. , 1964 .

[8]  Reid G. Simmons,et al.  Risk-Sensitive Planning with Probabilistic Decision Graphs , 1994, KR.

[9]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[10]  Matthias Heger,et al.  Consideration of Risk in Reinforcement Learning , 1994, ICML.

[11]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[12]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[13]  Thomas G. Dietterich,et al.  High-Performance Job-Shop Scheduling With A Time-Delay TD(λ) Network , 1995, NIPS 1995.

[14]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[15]  Csaba Szepesvári,et al.  A Generalized Reinforcement-Learning Model: Convergence and Applications , 1996, ICML.

[16]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[17]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[18]  Dimitri P. Bertsekas,et al.  Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems , 1996, NIPS.

[19]  John N. Tsitsiklis,et al.  Approximate Solutions to Optimal Stopping Problems , 1996, NIPS.

[20]  T. Basar,et al.  H∞-0ptimal Control and Related Minimax Design Problems: A Dynamic Game Approach , 1996, IEEE Trans. Autom. Control..

[21]  Ralph Neuneier,et al.  Enhancing Q-Learning for Optimal Asset Allocation , 1997, NIPS.

[22]  Csaba Szepesvári Non-Markovian Policies in Sequential Decision Problems , 1998, Acta Cybern..

[23]  S. Marcus,et al.  Risk-Sensitive, Minimax, and Mixed Risk-Neutral/Minimax Control of Markov Decision Processes , 1999 .

[24]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[25]  John N. Tsitsiklis,et al.  Call admission control and routing in integrated services networks using neuro-dynamic programming , 2000, IEEE Journal on Selected Areas in Communications.

[26]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.