论文信息 - Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return

Comparing Direct and Indirect Temporal-Difference Methods for Estimating the Variance of the Return

Temporal-difference (TD) learning methods are widely used in reinforcement learning to estimate the expected return for each state, without a model, because of their significant advantages in computational and data efficiency. For many applications involving risk mitigation, it would also be useful to estimate the variance of the return by TD methods. In this paper, we describe a way of doing this that is substantially simpler than those proposed by Tamar, Di Castro, and Mannor in 2012, or those proposed by White and White in 2016. We show that two TD learners operating in series can learn expectation and variance estimates. The trick is to use the square of the TD error of the expectation learner as the reward of the variance learner, and the square of the expectation learner’s discount rate as the discount rate of the variance learner. With these two modifications, the variance learning problem becomes a conventional TD learning problem to which standard theoretical results can be applied. Our formal results are limited to the table lookup case, for which our method is still novel, but the extension to function approximation is immediate, and we provide some empirical results for the linear function approximation case. Our experimental results show that our direct method behaves just as well as a comparable indirect method, but is generally more robust.

[1] S. Shankar Sastry,et al. Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[2] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[3] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[5] Mohammad Ghavamzadeh,et al. Actor-Critic Algorithms for Risk-Sensitive MDPs , 2013, NIPS.

[6] Shalabh Bhatnagar,et al. Fast gradient-descent methods for temporal-difference learning with linear function approximation , 2009, ICML '09.

[7] Yutaka Sakaguchi,et al. Reliability of internal prediction/estimation and its application. I. Adaptive action selection reflecting reliability of value function , 2004, Neural Networks.

[8] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[9] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .

[10] Shie Mannor,et al. Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[11] Shie Mannor,et al. Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..

[12] Shie Mannor,et al. Variance Adjusted Actor Critic Algorithms , 2013, ArXiv.

[13] Makoto Sato,et al. TD algorithm for the variance of return and mean-variance reinforcement learning , 2001 .

[14] Christos Dimitrakakis,et al. Ensembles for sequence learning , 2006 .

[15] Martha White,et al. Unifying Task Specification in Reinforcement Learning , 2016, ICML.

[16] Martha White,et al. A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning , 2016, AAMAS.

[17] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .