论文信息 - Learning the Variance of the Reward-To-Go

Learning the Variance of the Reward-To-Go

In Markov decision processes (MDPs), the variance of the reward-to-go is a natural measure of uncertainty about the long term performance of a policy, and is important in domains such as finance, resource allocation, and process control. Currently however, there is no tractable procedure for calculating it in large scale MDPs. This is in contrast to the case of the expected reward-to-go, also known as the value function, for which effective simulation-based algorithms are known, and have been used successfully in various domains. In this paper we extend temporal difference (TD) learning algorithms to estimating the variance of the reward-to-go for a fixed policy. We propose variants of both TD(0) and LSTD(λ) with linear function approximation, prove their convergence, and demonstrate their utility in an option pricing problem. Our results show a dramatic improvement in terms of sample efficiency over standard Monte-Carlo methods, which are currently the state-of-the-art.

Shie Mannor | Dotan Di Castro | Aviv Tamar | Aviv Tamar | Shie Mannor

[1] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[2] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[3] Fritz Wysotzki,et al. Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[4] John N. Tsitsiklis,et al. Regression methods for pricing complex American-style options , 2001, IEEE Trans. Neural Networks.

[5] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[6] N. D. Yen. Lipschitz Continuity of Solutions of Variational Inequalities with a Parametric Polyhedral Constraint , 1995, Math. Oper. Res..

[7] Shie Mannor,et al. Scaling Up Robust MDPs using Function Approximation , 2014, ICML.

[8] P. Olver. Nonlinear Systems , 2013 .

[9] Masashi Sugiyama,et al. Parametric Return Density Estimation for Reinforcement Learning , 2010, UAI.

[10] Jack L. Treynor,et al. MUTUAL FUND PERFORMANCE* , 2007 .

[11] Ali Esmaili,et al. Probability and Random Processes , 2005, Technometrics.

[12] Klaus Obermayer,et al. Risk-Sensitive Reinforcement Learning , 2013, Neural Computation.

[13] John N. Tsitsiklis,et al. Algorithmic aspects of mean-variance optimization in Markov decision processes , 2013, Eur. J. Oper. Res..

[14] Andrzej Ruszczynski,et al. Risk-averse dynamic programming for Markov decision processes , 2010, Math. Program..

[15] Makoto Sato,et al. TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[16] D. Duffie. Dynamic Asset Pricing Theory , 1992 .

[17] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .

[18] Charles R. Johnson,et al. Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[19] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[20] Joelle Pineau,et al. Informing sequential clinical decision-making through reinforcement learning: an empirical study , 2010, Machine Learning.

[21] Dimitri P. Bertsekas,et al. Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[22] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control 3rd Edition, Volume II , 2010 .

[23] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[24] V. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint , 2008, Texts and Readings in Mathematics.

[25] Shie Mannor,et al. Reinforcement learning with Gaussian processes , 2005, ICML.

[26] Matthew Saffell,et al. Learning to trade via direct reinforcement , 2001, IEEE Trans. Neural Networks.

[27] Dale Schuurmans,et al. Learning Exercise Policies for American Options , 2009, AISTATS.

[28] Alessandro Lazaric,et al. Finite-Sample Analysis of LSTD , 2010, ICML.

[29] D. Krass,et al. Percentile performance criteria for limiting average Markov decision processes , 1995, IEEE Trans. Autom. Control..

[30] Andrew G. Barto,et al. Reinforcement learning , 1998 .

[31] Mohammad Ghavamzadeh,et al. Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[32] Justin A. Boyan,et al. Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[33] Shie Mannor,et al. Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[34] S. Ross,et al. Option pricing: A simplified approach☆ , 1979 .

[35] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[36] Peter Stone,et al. Reinforcement learning , 2019, Scholarpedia.

[37] Dimitri P. Bertsekas,et al. Temporal Difference Methods for General Projected Equations , 2011, IEEE Transactions on Automatic Control.

[38] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[39] Francis A. Longstaff,et al. Valuing American Options by Simulation: A Simple Least-Squares Approach , 2001 .

[40] Jerzy A. Filar,et al. Variance-Penalized Markov Decision Processes , 1989, Math. Oper. Res..

[41] Shie Mannor,et al. Temporal Difference Methods for the Variance of the Reward To Go , 2013, ICML.

[42] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.