Understanding the Curse of Horizon in Off-Policy Evaluation via Conditional Importance Sampling

Off-policy policy estimators that use importance sampling (IS) can suffer from high variance in long-horizon domains, and there has been particular excitement over new IS methods that leverage the structure of Markov decision processes. We analyze the variance of the most popular approaches through the viewpoint of conditional Monte Carlo. Surprisingly, we find that in finite horizon MDPs there is no strict variance reduction of per-decision importance sampling or stationary importance sampling, comparing with vanilla importance sampling. We then provide sufficient conditions under which the per-decision or stationary estimators will provably reduce the variance over importance sampling with finite horizons. For the asymptotic (in terms of horizon $T$) case, we develop upper and lower bounds on the variance of those estimators which yields sufficient conditions under which there exists an exponential v.s. polynomial gap between the variance of importance sampling and that of the per-decision or stationary estimators. These results help advance our understanding of if and when new types of IS estimators will improve the accuracy of off-policy estimation.

[1]  A. Dubi,et al.  The Interpretation of Conditional Monte Carlo as a Form of Importance Sampling , 1979 .

[2]  W. Newey,et al.  Double machine learning for treatment and causal parameters , 2016 .

[3]  Yifei Ma,et al.  Optimal Off-Policy Evaluation for Reinforcement Learning with Marginalized Importance Sampling , 2019, NeurIPS.

[4]  Marc G. Bellemare,et al.  Off-Policy Deep Reinforcement Learning by Bootstrapping the Covariate Shift , 2019, AAAI.

[5]  P. Glynn,et al.  Likelihood Ratio Gradient Estimation for Steady-State Parameters , 2017, Stochastic Systems.

[6]  Pierre L'Ecuyer,et al.  Importance Sampling in Rare Event Simulation , 2009, Rare Event Simulation using Monte Carlo Methods.

[7]  Gerardo Rubino,et al.  Introduction to Rare Event Simulation , 2009, Rare Event Simulation using Monte Carlo Methods.

[8]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[9]  Pierre L'Ecuyer,et al.  Efficiency improvement and variance reduction , 1994, Proceedings of Winter Simulation Conference.

[10]  Donald L. Iglehart,et al.  Simulation methods for queues: An overview , 1988, Queueing Syst. Theory Appl..

[11]  Shie Mannor,et al.  Consistent On-Line Off-Policy Evaluation , 2017, ICML.

[12]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[13]  Leonid Peshkin,et al.  Learning from Scarce Experience , 2002, ICML.

[14]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[15]  Bruno Tuffin,et al.  Approximate zero-variance simulation , 2008, 2008 Winter Simulation Conference.

[16]  Galin L. Jones On the Markov chain central limit theorem , 2004, math/0409112.

[17]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[18]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .

[19]  Sean P. Meyn,et al.  A Liapounov bound for solutions of the Poisson equation , 1996 .

[20]  B. L. Granovsky Optimal Formulae of the Conditional Monte Carlo , 1981 .

[21]  Paul Bratley,et al.  A guide to simulation (2nd ed.) , 1986 .

[22]  S. Ross Simulating Average Delay–Variance Reduction by Conditioning , 1988, Probability in the Engineering and Informational Sciences.

[23]  P. Glynn Importance sampling for markov chains: asymptotics for the variance , 1994 .

[24]  Paul Glasserman,et al.  Filtered Monte Carlo , 1993, Math. Oper. Res..

[25]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[26]  James A. Bucklew Conditional importance sampling estimators , 2005, IEEE Transactions on Information Theory.

[27]  L. Breiman The Strong Law of Large Numbers for a Class of Markov Chains , 1960 .

[28]  M. J. Fryer,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[29]  J. M. Hammersley,et al.  Conditional Monte Carlo , 1956, JACM.

[30]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[31]  Paul Bratley,et al.  A guide to simulation , 1983 .

[32]  Hoang Minh Le,et al.  Empirical Analysis of Off-Policy Policy Evaluation for Reinforcement Learning , 2019 .

[33]  Masatoshi Uehara,et al.  Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning , 2019 .

[34]  Philip S. Thomas,et al.  Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation , 2017, NIPS.

[35]  Dennis D. Boos,et al.  A Converse to Scheffe's Theorem , 1985 .

[36]  T. Schaul,et al.  Conditional Importance Sampling for Off-Policy Learning , 2019, AISTATS.

[37]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[38]  Luca Martino,et al.  Advances in Importance Sampling , 2021, Wiley StatsRef: Statistics Reference Online.

[39]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[40]  Rajan Srinivasan Some results in importance sampling and an application to detection , 1998, Signal Process..