On Covariate Shift of Latent Confounders in Imitation and Reinforcement Learning

We consider the problem of using expert data with unobserved confounders for imitation and reinforcement learning. We begin by defining the problem of learning from confounded expert data in a contextual MDP setup. We analyze the limitations of learning from such data with and without external reward, and propose an adjustment of standard imitation learning algorithms to fit this setup. We then discuss the problem of distribution shift between the expert data and the online environment when the data is only partially observable. We prove possibility and impossibility results for imitation learning under arbitrary distribution shift of the missing covariates. When additional external reward is provided, we propose a sampling procedure that addresses the unknown shift and prove convergence to an optimal solution. Finally, we validate our claims empirically on challenging assistive healthcare and recommender system simulation tasks.

[1]  Shie Mannor,et al.  Off-Policy Evaluation in Partially Observable Environments , 2020, AAAI.

[2]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[3]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[4]  Dylan S. Small,et al.  Calibrating Sensitivity Analyses to Observed Covariates in Observational Studies , 2013, Biometrics.

[5]  Elias Bareinboim,et al.  A Calculus for Stochastic Interventions: Causal Effect Identification and Surrogate Experiments , 2020, AAAI.

[6]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[7]  Zhuoran Yang,et al.  Provably Efficient Causal Reinforcement Learning with Confounded Observational Data , 2020, NeurIPS.

[8]  Elias Bareinboim,et al.  Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes , 2019, NeurIPS.

[9]  Emma Brunskill,et al.  Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding , 2020, NeurIPS.

[10]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[11]  Siddhartha Srinivasa,et al.  Imitation Learning as f-Divergence Minimization , 2019, WAFR.

[12]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[13]  Nathan Kallus,et al.  Minimax-Optimal Policy Learning Under Unobserved Confounding , 2020, Manag. Sci..

[14]  C. Karen Liu,et al.  Assistive Gym: A Physics Simulation Framework for Assistive Robotics , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Shie Mannor,et al.  Bandits with partially observable confounded data , 2020, UAI.

[16]  Craig Boutilier,et al.  RecSim: A Configurable Simulation Platform for Recommender Systems , 2019, ArXiv.

[17]  Nathan Kallus,et al.  Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning , 2020, NeurIPS.