Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes

Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of $q$-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness.

[1]  Chris A. J. Klaassen,et al.  Consistent Estimation of the Influence Function of Locally Asymptotically Linear Estimators , 1987 .

[2]  J. Pearl,et al.  Causal Inference , 2011, Twenty-one Mental Models That Can Change Policing.

[3]  B. Chakraborty,et al.  Statistical methods for dynamic treatment regimes , 2013 .

[4]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[5]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[6]  A. Tsiatis Semiparametric Theory and Missing Data , 2006 .

[7]  Aad Van Der Vbart,et al.  ON DIFFERENTIABLE FUNCTIONALS , 1988 .

[8]  Sergey Levine,et al.  Offline policy evaluation across representations with applications to educational games , 2014, AAMAS.

[9]  Qi Li,et al.  Nonparametric Econometrics: Theory and Practice , 2006 .

[10]  Michael R. Kosorok,et al.  Estimating Dynamic Treatment Regimes in Mobile Health Using V-Learning , 2016, Journal of the American Statistical Association.

[11]  Xiaotong Shen,et al.  On methods of sieves and penalization , 1997 .

[12]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[13]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[14]  Judea Pearl,et al.  Causal Inference , 2010 .

[15]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[16]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[17]  Stefan Wager,et al.  Adaptive Concentration of Regression Trees, with Application to Random Forests , 2015, 1503.06388.

[18]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient and Robust Off-Policy Evaluation , 2020, ICML.

[19]  L. Hansen Large Sample Properties of Generalized Method of Moments Estimators , 1982 .

[20]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[21]  Thorsten Joachims,et al.  The Self-Normalized Estimator for Counterfactual Learning , 2015, NIPS.

[22]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[23]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[24]  Lihong Li,et al.  Toward Minimax Off-policy Value Estimation , 2015, AISTATS.

[25]  J. Robins,et al.  Double/Debiased Machine Learning for Treatment and Structural Parameters , 2017 .

[26]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[27]  Khashayar Khosravi,et al.  Non-Parametric Inference Adaptive to Intrinsic Dimension , 2019, CLeaR.

[28]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .

[29]  Gary Chamberlain,et al.  Comment: Sequential Moment Restrictions in Panel Data , 1992 .

[30]  Mehrdad Farajtabar,et al.  More Robust Doubly Robust Off-policy Evaluation , 2018, ICML.

[31]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2002 .

[32]  James M. Robins,et al.  Characterization of parameters with a mixed bias property , 2019, Biometrika.

[33]  Marie Davidian,et al.  Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. , 2013, Biometrika.

[34]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[35]  L. Hansen,et al.  Finite Sample Properties of Some Alternative Gmm Estimators , 2015 .

[36]  Xiaohong Chen Chapter 76 Large Sample Sieve Estimation of Semi-Nonparametric Models , 2007 .

[37]  J. Hahn On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects , 1998 .

[38]  Yu-Xiang Wang,et al.  Asymptotically Efficient Off-Policy Evaluation for Tabular Reinforcement Learning , 2020, AISTATS.

[39]  Aurélien F. Bibaut,et al.  Fast rates for empirical risk minimization with cadlag losses with bounded sectional variation norm , 2019, 1907.09244.

[40]  Stefan Wager,et al.  Uniform Convergence of Random Forests via Adaptive Concentration , 2015 .

[41]  Gautam Tripathi,et al.  A matrix extension of the Cauchy-Schwarz inequality , 1999 .

[42]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[43]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[44]  Iván Díaz,et al.  Machine learning in the estimation of causal effects: targeted minimum loss-based estimation and double/debiased machine learning. , 2019, Biostatistics.

[45]  Eric J. Tchetgen Tchetgen,et al.  Identification and Doubly Robust Estimation of Data Missing Not at Random with an Ancillary Variable , 2015 .

[46]  Xiaohong Chen,et al.  Efficient Estimation of Models with Conditional Moment Restrictions Containing Unknown Functions , 2003 .

[47]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[48]  Aurélien F. Bibaut,et al.  Fast rates for empirical risk minimization over c\`adl\`ag functions with bounded sectional variation norm , 2019 .

[49]  Kenji Fukumizu,et al.  Deep Neural Networks Learn Non-Smooth Functions Effectively , 2018, AISTATS.

[50]  Chunrong Ai,et al.  Semiparametric Efficiency Bound for Models of Sequential Moment Restrictions Containing Unknown Functions , 2009 .

[51]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[52]  Miroslav Dudík,et al.  Optimal and Adaptive Off-policy Evaluation in Contextual Bandits , 2016, ICML.

[53]  G. Imbens,et al.  Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score , 2000 .

[54]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[55]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[56]  John Langford,et al.  Doubly Robust Policy Evaluation and Optimization , 2014, ArXiv.

[57]  J. Robins A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect , 1986 .

[58]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[59]  Masatoshi Uehara,et al.  Intrinsically Efficient, Stable, and Bounded Off-Policy Evaluation for Reinforcement Learning , 2019, NeurIPS.

[60]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[61]  Clive G. Bowsher,et al.  Identifying sources of variation and the flow of information in biochemical networks , 2012, Proceedings of the National Academy of Sciences.

[62]  S. Murphy,et al.  Optimal dynamic treatment regimes , 2003 .

[63]  C. J. Stone,et al.  The Use of Polynomial Splines and Their Tensor Products in Multivariate Function Estimation , 1994 .

[64]  J. Robins,et al.  Marginal Structural Models and Causal Inference in Epidemiology , 2000, Epidemiology.

[65]  W. Newey,et al.  Large sample estimation and hypothesis testing , 1986 .

[66]  James M. Robins,et al.  Marginal Structural Models versus Structural nested Models as Tools for Causal inference , 2000 .

[67]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[68]  Nikos Vlassis,et al.  More Efficient Off-Policy Evaluation through Regularized Targeted Learning , 2019, ICML.

[69]  Mark J. van der Laan,et al.  Cross-Validated Targeted Minimum-Loss-Based Estimation , 2011 .

[70]  Mark J. van der Laan,et al.  The Highly Adaptive Lasso Estimator , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[71]  R. Strawderman,et al.  Constructing dynamic treatment regimes over indefinite time horizons , 2018, Biometrika.

[72]  Jinyong Hahn,et al.  Efficient estimation of panel data models with sequential moment restrictions , 1997 .