Toward Minimax Off-policy Value Estimation

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the single-state, or multi-armed bandit case, establish a finite-time minimax risk lower bound, and analyze the risk of three standard estimators. For the so-called regression estimator, we show that while it is asymptotically optimal, for small sample sizes it may perform suboptimally compared to an ideal oracle up to a multiplicative factor that depends on the number of actions. We also show that the other two popular estimators can be arbitrarily worse than the optimal, even in the limit of infinitely many data points. The performance of the estimators are studied in synthetic and real problems; illustrating the methods strengths and weaknesses. We also discuss the implications of these results for off-policy evaluation problems in contextual bandits and fixed-horizon Markov decision processes.

[1]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .

[2]  R. Z. Khasʹminskiĭ,et al.  Statistical estimation : asymptotic theory , 1981 .

[3]  D. Rubin,et al.  The central role of the propensity score in observational studies for causal effects , 1983 .

[4]  D. Rubin,et al.  Reducing Bias in Observational Studies Using Subclassification on the Propensity Score , 1984 .

[5]  P. Holland Statistics and Causal Inference , 1985 .

[6]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[7]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[8]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[9]  service Topic collections Notes , .

[10]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[11]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[12]  Tim Hesterberg,et al.  Monte Carlo Strategies in Scientific Computing , 2002, Technometrics.

[13]  Eli Upfal,et al.  Probability and Computing: Randomized Algorithms and Probabilistic Analysis , 2005 .

[14]  J. Langford,et al.  The Epoch-Greedy algorithm for contextual multi-armed bandits , 2007, NIPS 2007.

[15]  Richard S. Sutton,et al.  A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation , 2008, NIPS.

[16]  John Langford,et al.  Exploration scavenging , 2008, ICML '08.

[17]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[18]  Christian Igel,et al.  Hoeffding and Bernstein races for selecting policies in evolutionary direct policy search , 2009, ICML '09.

[19]  Lihong Li,et al.  Learning from Logged Implicit Exploration Data , 2010, NIPS.

[20]  Wei Chu,et al.  Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms , 2010, WSDM '11.

[21]  Sanjoy Dasgupta,et al.  Two faces of active learning , 2009, Theor. Comput. Sci..

[22]  Yaoliang Yu,et al.  Analysis of Kernel Mean Matching under Covariate Shift , 2012, ICML.

[23]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[24]  Olivier Nicol,et al.  Data-driven evaluation of Contextual Bandit algorithms and applications to Dynamic Recommendation , 2014 .