论文信息 - Conditional Importance Sampling for Off-Policy Learning - 字舞流文

Conditional Importance Sampling for Off-Policy Learning

The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms that arise from this framework.

T. Schaul | R. Munos | Mark Rowland | Will Dabney | H. V. Hasselt | A. Harutyunyan | Diana Borsa

[1] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[2] Hoon Kim,et al. Monte Carlo Statistical Methods , 2000, Technometrics.

[3] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[4] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[5] A. Rollett,et al. The Monte Carlo Method , 2004 .

[6] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7] Dimitri P. Bertsekas,et al. Stochastic optimal control : the discrete time case , 2007 .

[8] Masashi Sugiyama,et al. Nonparametric Return Distribution Approximation for Reinforcement Learning , 2010, ICML.

[9] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[10] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[11] R. Sutton,et al. Gradient temporal-difference learning algorithms , 2011 .

[12] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[13] J. Norris. Appendix: probability and measure , 1997 .

[14] Simo Srkk,et al. Bayesian Filtering and Smoothing , 2013 .

[15] Simo Särkkä,et al. Bayesian Filtering and Smoothing , 2013, Institute of Mathematical Statistics textbooks.

[16] Richard S. Sutton,et al. Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[17] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[18] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[19] Richard S. Sutton,et al. Off-policy TD( l) with a true online equivalence , 2014, UAI.

[20] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.

[21] Tom Schaul,et al. Universal Value Function Approximators , 2015, ICML.

[22] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[23] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.

[24] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[25] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[26] Shie Mannor,et al. Generalized Emphatic Temporal Difference Learning: Bias-Variance Analysis , 2015, AAAI.

[27] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[28] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[29] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[30] Lihong Li,et al. Stochastic Variance Reduction Methods for Policy Evaluation , 2017, ICML.

[31] Richard S. Sutton,et al. Multi-step Off-policy Learning Without Importance Sampling Ratios , 2017, ArXiv.

[32] Shie Mannor,et al. Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[33] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[34] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[35] Shane Legg,et al. IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[36] Marcello Restelli,et al. Policy Optimization via Importance Sampling , 2018, NeurIPS.

[37] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[38] Marc G. Bellemare,et al. Distributional Reinforcement Learning with Quantile Regression , 2017, AAAI.

[39] Marcello Restelli,et al. Optimistic Policy Optimization via Multiple Importance Sampling , 2019, ICML.

[40] Yifei Ma,et al. Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[41] Bo Dai,et al. DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections , 2019, NeurIPS.

[42] Yifei Ma,et al. Marginalized Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS 2019.

[43] Nicolas Le Roux,et al. A Geometric Perspective on Optimal Representations for Reinforcement Learning , 2019, NeurIPS.

[44] Peter Stone,et al. Importance Sampling Policy Evaluation with an Estimated Behavior Policy , 2018, ICML.

[45] Marc G. Bellemare,et al. Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[46] Masatoshi Uehara,et al. Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .

[47] Masatoshi Uehara,et al. Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[48] Masatoshi Uehara,et al. Minimax Weight and Q-Function Learning for Off-Policy Evaluation , 2019, ICML.

[49] Yao Liu,et al. Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling , 2019, ICML.