Policy Gradients with Variance Related Risk Criteria

Managing risk in dynamic decision problems is of cardinal importance in many fields such as finance and process control. The most common approach to defining risk is through various variance related criteria such as the Sharpe Ratio or the standard deviation adjusted reward. It is known that optimizing many of the variance related risk criteria is NP-hard. In this paper we devise a framework for local policy gradient style algorithms for reinforcement learning for variance related criteria. Our starting point is a new formula for the variance of the cost-to-go in episodic tasks. Using this formula we develop policy gradient algorithms for criteria that involve both the expected cost and the variance of the cost. We prove the convergence of these algorithms to local minima and demonstrate their applicability in a portfolio planning problem.

[1]  V. Borkar Stochastic approximation with two time scales , 1997 .

[2]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[3]  D. Krass,et al.  Percentile performance criteria for limiting average Markov decision processes , 1995, IEEE Trans. Autom. Control..

[4]  John N. Tsitsiklis,et al.  Mean-Variance Optimization in Markov Decision Processes , 2011, ICML.

[5]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[6]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[7]  Jack L. Treynor,et al.  MUTUAL FUND PERFORMANCE* , 2007 .

[8]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[9]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[10]  E. J. Collins,et al.  Convergent multiple-timescales reinforcement learning algorithms in normal form games , 2003 .

[11]  Bingkun Bao,et al.  Infinite-Horizon Policy-Gradient Estimation with Variable Discount Factor for Markov Decision Process , 2008, 2008 3rd International Conference on Innovative Computing Information and Control.

[12]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[13]  R. Howard,et al.  Risk-Sensitive Markov Decision Processes , 1972 .

[14]  J. Cockcroft Investment in Science , 1962, Nature.

[15]  Fritz Wysotzki,et al.  Risk-Sensitive Reinforcement Learning Applied to Control under Constraints , 2005, J. Artif. Intell. Res..

[16]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Vol. II , 1976 .

[17]  Sean P. Meyn,et al.  Risk-Sensitive Optimal Control for Markov Decision Processes with Monotone Cost , 2002, Math. Oper. Res..

[18]  Shie Mannor,et al.  Percentile Optimization for Markov Decision Processes with Parameter Uncertainty , 2010, Oper. Res..

[19]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[20]  Jesse Hoey,et al.  An analytic solution to discrete Bayesian reinforcement learning , 2006, ICML.

[21]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..