Q($\lambda$) with Off-Policy Corrections
暂无分享,去创建一个
[1] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.
[2] Richard S. Sutton,et al. Off-policy learning based on weighted importance sampling with linear computational complexity , 2015, UAI.
[3] Shimon Whiteson,et al. A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.
[4] H. Kahn,et al. Methods of Reducing Sample Size in Monte Carlo Computations , 1953, Oper. Res..
[5] Richard S. Sutton,et al. True online TD(λ) , 2014, ICML 2014.
[6] Preben Alstrøm,et al. Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.
[7] Peter Dayan,et al. Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.
[8] Michael Kearns,et al. Bias-Variance Error Bounds for Temporal Difference Updates , 2000, COLT.
[9] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.
[10] R. Bellman. Dynamic programming. , 1957, Science.
[11] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[12] Shie Mannor,et al. Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.
[13] Tommi S. Jaakkola,et al. Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.
[14] Martha White,et al. Emphatic Temporal-Difference Learning , 2015, ArXiv.
[15] Jing Peng,et al. Incremental multi-step Q-learning , 1994, Machine Learning.
[16] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.
[17] Hado Philip van Hasselt,et al. Insights in reinforcement rearning : formal analysis and empirical evaluation of temporal-difference learning algorithms , 2011 .