A comparison of methods for model selection when estimating individual treatment effects

Practitioners in medicine, business, political science, and other fields are increasingly aware that decisions should be personalized to each patient, customer, or voter. A given treatment (e.g. a drug or advertisement) should be administered only to those who will respond most positively, and certainly not to those who will be harmed by it. Individual-level treatment effects can be estimated with tools adapted from machine learning, but different models can yield contradictory estimates. Unlike risk prediction models, however, treatment effect models cannot be easily evaluated against each other using a held-out test set because the true treatment effect itself is never directly observed. Besides outcome prediction accuracy, several metrics that can leverage held-out data to evaluate treatment effects models have been proposed, but they are not widely used. We provide a didactic framework that elucidates the relationships between the different approaches and compare them all using a variety of simulations of both randomized and observational data. Our results show that researchers estimating heterogenous treatment effects need not limit themselves to a single model-fitting algorithm. Instead of relying on a single method, multiple models fit by a diverse set of algorithms should be evaluated against each other using an objective function learned from the validation set. The model minimizing that objective should be used for estimating the individual treatment effect for future individuals.

[1]  Sören R. Künzel,et al.  Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning , 2017 .

[2]  Trevor Hastie,et al.  Some methods for heterogeneous treatment effect estimation in high dimensions , 2017, Statistics in medicine.

[3]  Sherri Rose,et al.  Implementation of G-computation on a simulated data set: demonstration of a causal inference technique. , 2011, American journal of epidemiology.

[4]  S. Dudoit,et al.  Asymptotics of cross-validated risk estimation in estimator selection and performance assessment , 2005 .

[5]  David Simchi-Levi,et al.  Uplift Modeling with Multiple Treatments and General Response Types , 2017, SDM.

[6]  S. Matza,et al.  Psychological targeting as an effective approach to digital mass persuasion , 2017 .

[7]  S. Athey,et al.  Generalized random forests , 2016, The Annals of Statistics.

[8]  James M. Robins,et al.  Unified Methods for Censored Longitudinal Data and Causality , 2003 .

[9]  Jennifer Hill,et al.  Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition , 2017, Statistical Science.

[10]  Eva Ascarza Retention Futility: Targeting High-Risk Customers Might be Ineffective , 2018 .

[11]  Trevor Hastie,et al.  Model Assessment and Selection , 2009 .

[12]  Richard L Kravitz,et al.  Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages. , 2004, The Milbank quarterly.

[13]  D. Rubin Causal Inference Using Potential Outcomes , 2005 .

[14]  N. Hjort,et al.  The Focused Information Criterion , 2003 .

[15]  Sylvain Arlot,et al.  A survey of cross-validation procedures for model selection , 2009, 0907.4728.

[16]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[17]  Stefan Wager,et al.  High-dimensional regression adjustments in randomized experiments , 2016, Proceedings of the National Academy of Sciences.

[18]  Pierre Gutierrez,et al.  Causal Inference and Uplift Modelling: A Review of the Literature , 2017, PAPIs.

[19]  Catherine P. Bradshaw,et al.  Assessing the Generalizability of Randomized Trial Results to Target Populations , 2015, Prevention Science.

[20]  Craig A. Rolling,et al.  Model selection for estimating treatment effects , 2014 .

[21]  David Simchi-Levi,et al.  A Practically Competitive and Provably Consistent Algorithm for Uplift Modeling , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[22]  Susan Athey,et al.  Recursive partitioning for heterogeneous causal effects , 2015, Proceedings of the National Academy of Sciences.

[23]  Peter C. Austin,et al.  Using Ensemble-Based Methods for Directly Estimating Causal Effects: An Investigation of Tree-Based G-Computation , 2012, Multivariate behavioral research.

[24]  Jennifer L. Hill,et al.  Bayesian Nonparametric Modeling for Causal Inference , 2011 .

[25]  Martin Jullum,et al.  Focused Information criteria for selecting among parametric and nonparametric models , 2012 .

[26]  Adam Kapelner,et al.  Inference for the Effectiveness of Personalized Medicine with Software , 2014 .

[27]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[28]  Xinkun Nie,et al.  Learning Objectives for Treatment Effect Estimation , 2017 .

[29]  M. Gail,et al.  Testing for qualitative interactions between treatment effects and patient subsets. , 1985, Biometrics.

[30]  Jodi B. Segal,et al.  Understanding Heterogeneity of Treatment Effects in Pragmatic Trials , 2012 .

[31]  Patrick Rossignol,et al.  Individualizing treatment choices in the systolic blood pressure intervention trial , 2018, Journal of hypertension.