论文信息 - Off-Policy Evaluation in Partially Observable Environments - 字舞流文

Off-Policy Evaluation in Partially Observable Environments

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

Shie Mannor | Uri Shalit | Guy Tennenholtz

[1] Matthijs T. J. Spaan,et al. Partially Observable Markov Decision Processes , 2010, Encyclopedia of Machine Learning.

[2] Judea Pearl,et al. Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution , 2018, WSDM.

[3] David Sontag,et al. Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[4] Bernhard Schölkopf,et al. Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[5] J M Robins,et al. Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[6] Elias Bareinboim,et al. Counterfactual Data-Fusion for Online Reinforcement Learners , 2017, ICML.

[7] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8] Yisong Yue,et al. Batch Policy Learning under Constraints , 2019, ICML.

[9] Judea Pearl,et al. Causal Inference , 2010 .

[10] Nan Jiang,et al. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[11] J. Pearl,et al. Measurement bias and effect restoration in causal inference , 2014 .

[12] Qiang Liu,et al. Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[13] John Langford,et al. Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[14] E. Bareinboim,et al. Markov Decision Processes with Unobserved Confounders : A Causal Approach , 2016 .

[15] Yishay Mansour,et al. Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[16] Elias Bareinboim,et al. Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[17] P. Spirtes,et al. Causation, prediction, and search , 1993 .

[18] J. Robins,et al. Comparison of dynamic treatment regimes via inverse probability weighting. , 2006, Basic & clinical pharmacology & toxicology.

[19] Srivatsan Srinivasan,et al. Evaluating Reinforcement Learning Algorithms in Observational Health Settings , 2018, ArXiv.

[20] Alexandros G. Dimakis,et al. Contextual Bandits with Latent Confounders: An NMF Approach , 2016, AISTATS.

[21] Peter Stone,et al. Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[22] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23] Jan Peters,et al. Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[24] J. Pearl. Causality: Models, Reasoning and Inference , 2000 .

[25] Anne Condon,et al. On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[26] I. V. Romanovskii. Existence of an Optimal Stationary Policy in a Markov Decision Process , 1965 .

[27] Steve J. Young,et al. Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[28] Philip S. Thomas,et al. Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[29] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[30] Béatrice Finance,et al. A Causal Multi-armed Bandit Approach for Domestic Robots' Failure Avoidance , 2017, ICONIP.

[31] Masatoshi Uehara,et al. Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[32] John N. Tsitsiklis,et al. The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[33] Xiaoyan Zhu,et al. Linguistically Regularized LSTMs for Sentiment Classification , 2016, ArXiv.

[34] J. Pearl,et al. Causal Inference , 2011, Twenty-one Mental Models That Can Change Policing.

[35] Charles Bordenave,et al. Circular law theorem for random Markov matrices , 2008, 0808.1502.

[36] Max Welling,et al. Causal Effect Inference with Deep Latent-Variable Models , 2017, NIPS 2017.

[37] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[38] Bernhard Schölkopf,et al. Deconfounding Reinforcement Learning in Observational Settings , 2018, ArXiv.

[39] Z. Geng,et al. Identifying Causal Effects With Proxy Variables of an Unmeasured Confounder. , 2016, Biometrika.