Off-Policy Evaluation in Partially Observable Environments

This work studies the problem of batch off-policy evaluation for Reinforcement Learning in partially observable environments. Off-policy evaluation under partial observability is inherently prone to bias, with risk of arbitrarily large errors. We define the problem of off-policy evaluation for Partially Observable Markov Decision Processes (POMDPs) and establish what we believe is the first off-policy evaluation result for POMDPs. In addition, we formulate a model in which observed and unobserved variables are decoupled into two dynamic processes, called a Decoupled POMDP. We show how off-policy evaluation can be performed under this new model, mitigating estimation errors inherent to general POMDPs. We demonstrate the pitfalls of off-policy evaluation in POMDPs using a well-known off-policy method, Importance Sampling, and compare it with our result on synthetic medical data.

[1]  Matthijs T. J. Spaan,et al.  Partially Observable Markov Decision Processes , 2010, Encyclopedia of Machine Learning.

[2]  Judea Pearl,et al.  Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution , 2018, WSDM.

[3]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[4]  Bernhard Schölkopf,et al.  Elements of Causal Inference: Foundations and Learning Algorithms , 2017 .

[5]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[6]  Elias Bareinboim,et al.  Counterfactual Data-Fusion for Online Reinforcement Learners , 2017, ICML.

[7]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[8]  Yisong Yue,et al.  Batch Policy Learning under Constraints , 2019, ICML.

[9]  Judea Pearl,et al.  Causal Inference , 2010 .

[10]  Nan Jiang,et al.  Doubly Robust Off-policy Value Evaluation for Reinforcement Learning , 2015, ICML.

[11]  J. Pearl,et al.  Measurement bias and effect restoration in causal inference , 2014 .

[12]  Qiang Liu,et al.  Breaking the Curse of Horizon: Infinite-Horizon Off-Policy Estimation , 2018, NeurIPS.

[13]  John Langford,et al.  Doubly Robust Policy Evaluation and Learning , 2011, ICML.

[14]  E. Bareinboim,et al.  Markov Decision Processes with Unobserved Confounders : A Causal Approach , 2016 .

[15]  Yishay Mansour,et al.  Reinforcement Learning in POMDPs Without Resets , 2005, IJCAI.

[16]  Elias Bareinboim,et al.  Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[17]  P. Spirtes,et al.  Causation, prediction, and search , 1993 .

[18]  J. Robins,et al.  Comparison of dynamic treatment regimes via inverse probability weighting. , 2006, Basic & clinical pharmacology & toxicology.

[19]  Srivatsan Srinivasan,et al.  Evaluating Reinforcement Learning Algorithms in Observational Health Settings , 2018, ArXiv.

[20]  Alexandros G. Dimakis,et al.  Contextual Bandits with Latent Confounders: An NMF Approach , 2016, AISTATS.

[21]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[22]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[23]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[24]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[25]  Anne Condon,et al.  On the Undecidability of Probabilistic Planning and Infinite-Horizon Partially Observable Markov Decision Problems , 1999, AAAI/IAAI.

[26]  I. V. Romanovskii Existence of an Optimal Stationary Policy in a Markov Decision Process , 1965 .

[27]  Steve J. Young,et al.  Partially observable Markov decision processes for spoken dialog systems , 2007, Comput. Speech Lang..

[28]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[29]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[30]  Béatrice Finance,et al.  A Causal Multi-armed Bandit Approach for Domestic Robots' Failure Avoidance , 2017, ICONIP.

[31]  Masatoshi Uehara,et al.  Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes , 2019, J. Mach. Learn. Res..

[32]  John N. Tsitsiklis,et al.  The Complexity of Markov Decision Processes , 1987, Math. Oper. Res..

[33]  Xiaoyan Zhu,et al.  Linguistically Regularized LSTMs for Sentiment Classification , 2016, ArXiv.

[34]  J. Pearl,et al.  Causal Inference , 2011, Twenty-one Mental Models That Can Change Policing.

[35]  Charles Bordenave,et al.  Circular law theorem for random Markov matrices , 2008, 0808.1502.

[36]  Max Welling,et al.  Causal Effect Inference with Deep Latent-Variable Models , 2017, NIPS 2017.

[37]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[38]  Bernhard Schölkopf,et al.  Deconfounding Reinforcement Learning in Observational Settings , 2018, ArXiv.

[39]  Z. Geng,et al.  Identifying Causal Effects With Proxy Variables of an Unmeasured Confounder. , 2016, Biometrika.