Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation

In many applications of data mining a – sometimes considerable – part of the data values is missing. Despite the frequent occurrence of missing data, most data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. We investigate simulation-based data augmentation to handle missing data, which is based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets. We compare the use of imputation to surrogate splits, such as used in CART, to handle missing data in tree-based mining algorithms. Experiments show that imputation tends to outperform surrogate splits in terms of predictive accuracy of the resulting models. Averaging over M > 1 models resulting from M imputations yields even better results as it profits from variance reduction in much the same way as procedures such as bagging.