Bandits with partially observable confounded data

We study linear contextual bandits with access to a large, confounded, offline dataset that was sampled from some fixed policy. We show that this problem is closely related to a variant of the bandit problem with side information. We construct a linear bandit algorithm that takes advantage of the projected information, and prove regret bounds. Our results demonstrate the ability to take advantage of confounded offline data. Particularly, we prove regret bounds that improve current bounds by a factor related to the visible dimensionality of the contexts in the data. Our results indicate that confounded offline data can significantly improve online learning algorithms. Finally, we demonstrate various characteristics of our approach through synthetic simulations.

[1]  J. Pearl Causal inference in statistics: An overview , 2009 .

[2]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[3]  Thorsten Joachims,et al.  Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..

[4]  Nikhil R. Devanur,et al.  Bandits with Global Convex Constraints and Objective , 2019, Oper. Res..

[5]  Thorsten Joachims,et al.  Multi-armed Bandit Problems with History , 2012, AISTATS.

[6]  J M Robins,et al.  Marginal Mean Models for Dynamic Regimes , 2001, Journal of the American Statistical Association.

[7]  Elias Bareinboim,et al.  Bandits with Unobserved Confounders: A Causal Approach , 2015, NIPS.

[8]  Philip S. Thomas,et al.  Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning , 2016, ICML.

[9]  Fredrik D. Johansson,et al.  Guidelines for reinforcement learning in healthcare , 2019, Nature Medicine.

[10]  Louis Wehenkel,et al.  Batch mode reinforcement learning based on the synthesis of artificial trajectories , 2013, Ann. Oper. Res..

[11]  Michael R. Lyu,et al.  CBRAP: Contextual Bandits with RAndom Projection , 2017, AAAI.

[12]  Paul Covington,et al.  Deep Neural Networks for YouTube Recommendations , 2016, RecSys.

[13]  John C. S. Lui,et al.  Combining Offline Causal Inference and Online Bandit Learning for Data Driven Decisions , 2020, ArXiv.

[14]  Leland Gerson Neuberg,et al.  CAUSALITY: MODELS, REASONING, AND INFERENCE, by Judea Pearl, Cambridge University Press, 2000 , 2003, Econometric Theory.

[15]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.

[16]  J. C. A. Barata,et al.  The Moore–Penrose Pseudoinverse: A Tutorial Review of the Theory , 2011, 1110.6882.

[17]  Moritz Werling,et al.  Reinforcement Learning for Autonomous Maneuvering in Highway Scenarios , 2017 .

[18]  Nikhil R. Devanur,et al.  Linear Contextual Bandits with Knapsacks , 2015, NIPS.

[19]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[20]  Andreas Krause,et al.  High-Dimensional Gaussian Process Bandits , 2013, NIPS.

[21]  Donald Gillies,et al.  Causality: Models, Reasoning, and Inference Judea Pearl , 2001 .

[22]  G. Stewart On the Perturbation of Pseudo-Inverses, Projections and Linear Least Squares Problems , 1977 .

[23]  P. Wedin Perturbation theory for pseudo-inverses , 1973 .

[24]  Elias Bareinboim,et al.  Near-Optimal Reinforcement Learning in Dynamic Treatment Regimes , 2019, NeurIPS.

[25]  Christos Thrampoulidis,et al.  Linear Stochastic Bandits Under Safety Constraints , 2019, NeurIPS.

[26]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[27]  Wen Li,et al.  On perturbation bounds for orthogonal projections , 2016, Numerical Algorithms.

[28]  Tor Lattimore,et al.  Causal Bandits: Learning Good Interventions via Causal Inference , 2016, NIPS.

[29]  Amir Leshem,et al.  Finite Sample Performance of Linear Least Squares Estimators Under Sub-Gaussian Martingale Difference Noise , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[30]  Susan A. Murphy,et al.  Linear fitted-Q iteration with multiple reward functions , 2013, J. Mach. Learn. Res..

[31]  John N. Tsitsiklis,et al.  Linearly Parameterized Bandits , 2008, Math. Oper. Res..

[32]  Judea Pearl,et al.  What Counterfactuals Can Be Tested , 2007, UAI.

[33]  Mélanie Frappier,et al.  The Book of Why: The New Science of Cause and Effect , 2018, Science.

[34]  Csaba Szepesvári,et al.  Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.

[35]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[36]  G. Stewart On the Continuity of the Generalized Inverse , 1969 .

[37]  Shie Mannor,et al.  Off-Policy Evaluation in Partially Observable Environments , 2020, AAAI.

[38]  David Sontag,et al.  Counterfactual Off-Policy Evaluation with Gumbel-Max Structural Causal Models , 2019, ICML.

[39]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[40]  Yao Liu,et al.  Combining Parametric and Nonparametric Models for Off-Policy Evaluation , 2019, ICML.

[41]  Babak Hassibi,et al.  Stochastic Linear Bandits with Hidden Low Rank Structure , 2019, ArXiv.

[42]  Elias Bareinboim,et al.  Sensitivity Analysis of Linear Structural Causal Models , 2019, ICML.

[43]  Benjamin Van Roy,et al.  Conservative Contextual Linear Bandits , 2016, NIPS.

[44]  Rafic Younes,et al.  Review of Optimization Methods for Cancer Chemotherapy Treatment Planning , 2015 .

[45]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[46]  M. de Rijke,et al.  Deep Learning with Logged Bandit Feedback , 2018, ICLR.

[47]  Huazheng Wang,et al.  Learning Hidden Features for Contextual Bandits , 2016, CIKM.

[48]  Jason Weston,et al.  Learning through Dialogue Interactions by Asking Questions , 2016, ICLR.

[49]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[50]  Wei Chu,et al.  Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.

[51]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[52]  Zvi Griliches,et al.  Specification Bias in Estimates of Production Functions , 1957 .

[53]  Uri Shalit,et al.  Removing Hidden Confounding by Experimental Grounding , 2018, NeurIPS.