Doubly Robust Off-policy Value Evaluation for Reinforcement Learning

We study the problem of off-policy value evaluation in reinforcement learning (RL), where one aims to estimate the value of a new policy based on data collected by a different policy. This problem is often a critical step when applying RL in real-world problems. Despite its importance, existing general methods either have uncontrolled bias or suffer high variance. In this work, we extend the doubly robust estimator for bandits to sequential decision-making problems, which gets the best of both worlds: it is guaranteed to be unbiased and can have a much lower variance than the popular importance sampling estimators. We demonstrate the estimator's accuracy in several benchmark problems, and illustrate its use as a subroutine in safe policy improvement. We also provide theoretical results on the hardness of the problem, and show that our estimator can match the lower bound in certain scenarios.

[1]  P. Holland Statistics and Causal Inference , 1985 .

[2]  J. Robins,et al.  Semiparametric regression estimation in the presence of dependent censoring , 1995 .

[3]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[4]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[5]  Doina Precup,et al.  Intra-Option Learning about Temporally Abstract Actions , 1998, ICML.

[6]  Alex M. Andrew,et al.  Reinforcement Learning: : An Introduction , 1998 .

[7]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[8]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[9]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[10]  Doina Precup,et al.  Temporal abstraction in reinforcement learning , 2000, ICML 2000.

[11]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[12]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[13]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[14]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[15]  Naoki Abe,et al.  Sequential cost-sensitive decision making with reinforcement learning , 2002, KDD.

[16]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[17]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.

[18]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[19]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[20]  Peter Stone,et al.  Model-based function approximation in reinforcement learning , 2007, AAMAS '07.

[21]  T. Moore A Theory of Cramer-Rao Bounds for Constrained Parametric Models , 2010 .

[22]  Csaba Szepesvári,et al.  Model Selection in Reinforcement Learning , 2011, Machine Learning.

[23]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[24]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[25]  Guy Lever,et al.  Modelling transition dynamics in MDPs with RKHS embeddings , 2012, ICML.

[26]  Louis Wehenkel,et al.  Batch mode reinforcement learning based on the synthesis of artificial trajectories , 2013, Ann. Oper. Res..

[27]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[28]  Daniele Calandriello,et al.  Safe Policy Iteration , 2013, ICML.

[29]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[30]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[31]  Nan Jiang,et al.  Abstraction Selection in Model-based Reinforcement Learning , 2015, ICML.

[32]  Philip S. Thomas,et al.  High Confidence Policy Improvement , 2015, ICML.

[33]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[34]  Vukosi Marivate,et al.  Improved empirical methods in reinforcement-learning evaluation , 2015 .

[35]  Jianfeng Gao,et al.  Recurrent Reinforcement Learning: A Hybrid Approach , 2015, ArXiv.

[36]  Philip S. Thomas,et al.  High-Confidence Off-Policy Evaluation , 2015, AAAI.

[37]  Philip S. Thomas,et al.  Safe Reinforcement Learning , 2015 .

[38]  Martha White,et al.  An Emphatic Approach to the Problem of Off-policy Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[39]  A. Preliminaries Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016 .