Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
暂无分享,去创建一个
[1] P. Holland. Statistics and Causal Inference , 1985 .
[2] J. Robins,et al. Semiparametric regression estimation in the presence of dependent censoring , 1995 .
[3] Richard S. Sutton,et al. Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.
[4] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.
[5] Doina Precup,et al. Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.
[6] Alex M. Andrew,et al. Reinforcement Learning: : An Introduction , 1998 .
[7] G. Imbens,et al. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .
[8] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .
[9] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.
[10] Doina Precup,et al. Temporal abstraction in reinforcement learning , 2000, ICML 2000.
[11] J M Robins,et al. Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.
[12] Sanjoy Dasgupta,et al. Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.
[13] G. Imbens,et al. Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .
[14] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.
[15] Naoki Abe,et al. Sequential cost-sensitive decision making with reinforcement learning , 2002, KDD.
[16] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[17] Liming Xiang,et al. Kernel-Based Reinforcement Learning , 2006, ICIC.
[18] Csaba Szepesvári,et al. Bandit Based Monte-Carlo Planning , 2006, ECML.
[19] John N. Tsitsiklis,et al. Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..
[20] Peter Stone,et al. Model-based function approximation in reinforcement learning , 2007, AAMAS '07.
[21] T. Moore. A Theory of Cramer-Rao Bounds for Constrained Parametric Models , 2010 .
[22] Csaba Szepesvári,et al. Model Selection in Reinforcement Learning , 2011, Machine Learning.
[23] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.
[24] Wei Chu,et al. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.
[25] Guy Lever,et al. Modelling transition dynamics in MDPs with RKHS embeddings , 2012, ICML.
[26] Louis Wehenkel,et al. Batch mode reinforcement learning based on the synthesis of artificial trajectories , 2013, Ann. Oper. Res..
[27] Joaquin Quiñonero Candela,et al. Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..
[28] Daniele Calandriello,et al. Safe Policy Iteration , 2013, ICML.
[29] Sergey Levine,et al. Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.
[30] Jan Peters,et al. Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..
[31] Nan Jiang,et al. Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.
[32] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.
[33] Lihong Li,et al. Toward Minimax Off-policy Value Estimation , 2015, AISTATS.
[34] Vukosi Marivate,et al. Improved empirical methods in reinforcement-learning evaluation , 2015 .
[35] Jianfeng Gao,et al. Recurrent Reinforcement Learning: A Hybrid Approach , 2015, ArXiv.
[36] Philip S. Thomas,et al. High-Confidence Off-Policy Evaluation , 2015, AAAI.
[37] Philip S. Thomas,et al. Safe Reinforcement Learning , 2015 .
[38] Martha White,et al. An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..
[39] A. Preliminaries. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016 .