论文信息 - Temporal Difference Methods for the Variance of the Reward To Go

Temporal Difference Methods for the Variance of the Reward To Go

In this paper we extend temporal difference policy evaluation algorithms to performance criteria that include the variance of the cumulative reward. Such criteria are useful for risk management, and are important in domains such as finance and process control. We propose variants of both TD(0) and LSTD(λ) with linear function approximation, prove their convergence, and demonstrate their utility in a 4-dimensional continuous state space problem.

Shie Mannor | Dotan Di Castro | Aviv Tamar

[1] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[2] Charles R. Johnson,et al. Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[3] D. Krass,et al. Percentile performance criteria for limiting average Markov decision processes , 1995, IEEE Trans. Autom. Control..

[4] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[5] Masashi Sugiyama,et al. Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[6] Makoto Sato,et al. TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[7] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[8] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9] Shie Mannor,et al. Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[10] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[11] John N. Tsitsiklis,et al. Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[12] Ralph Neuneier,et al. Risk-Sensitive Reinforcement Learning , 1998, Machine Learning.

[13] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Vol. II , 1976 .

[14] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[15] Jack L. Treynor,et al. MUTUAL FUND PERFORMANCE* , 2007 .

[16] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[17] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[18] Alessandro Lazaric,et al. Finite-Sample Analysis of LSTD , 2010, ICML.

[19] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[20] Andrew G. Barto,et al. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining , 2009, NIPS.

[21] Gerald Tesauro,et al. Temporal difference learning and TD-Gammon , 1995, CACM.

[22] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[23] Joelle Pineau,et al. Informing sequential clinical decision-making through reinforcement learning: an empirical study , 2010, Machine Learning.

[24] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .

[25] Fritz Wysotzki,et al. Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..