论文信息 - Optimality of LSTD and its Relation to MC

Optimality of LSTD and its Relation to MC

In this analytical study we compare the risk of the Monte Carlo (MC) and the least-squares TD (LSTD) estimator. We prove that for the case of acyclic Markov Reward Processes (MRPs) LSTD has minimal risk for any convex loss function in the class of unbiased estimators. When comparing the Monte Carlo estimator, which does not assume a Markov structure, and LSTD, we find that the Monte Carlo estimator is equivalent to LSTD if both estimators have the same amount of information. Theoretical results are supported by an empirical evaluation of the estimators.

[1] Richard S. Sutton,et al. Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[2] John N. Tsitsiklis,et al. Bias and variance in value function estimation , 2004, ICML.

[3] Andrew W. Moore,et al. Learning evaluation functions for global optimization , 1998 .

[4] E. L. Lehmann,et al. Theory of point estimation , 1950 .

[5] Peter Dayan,et al. Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[6] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[7] Sailes K. Sengijpta. Fundamentals of Statistical Signal Processing: Estimation Theory , 1995 .

[8] Andrew G. Barto,et al. Linear Least-Squares Algorithms for Temporal Difference Learning , 2005, Machine Learning.

[9] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[10] Michael Kearns,et al. Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.

[11] M. Kendall,et al. Kendall's advanced theory of statistics , 1995 .

[12] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[13] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.