论文信息 - Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation

Handling Missing Data in Trees: Surrogate Splits or Statistical Imputation

In many applications of data mining a – sometimes considerable – part of the data values is missing. Despite the frequent occurrence of missing data, most data mining algorithms handle missing data in a rather ad-hoc way, or simply ignore the problem. We investigate simulation-based data augmentation to handle missing data, which is based on filling-in (imputing) one or more plausible values for the missing data. One advantage of this approach is that the imputation phase is separated from the analysis phase, allowing for different data mining algorithms to be applied to the completed data sets. We compare the use of imputation to surrogate splits, such as used in CART, to handle missing data in tree-based mining algorithms. Experiments show that imputation tends to outperform surrogate splits in terms of predictive accuracy of the resulting models. Averaging over M > 1 models resulting from M imputations yields even better results as it profits from variance reduction in much the same way as procedures such as bagging.

A. J. Feelders | A. Feelders

[1] J L Schafer,et al. Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst's Perspective. , 1998, Multivariate behavioral research.

[2] D. Rubin. Multiple Imputation After 18+ Years , 1996 .

[3] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[4] Leo Breiman,et al. Bagging Predictors , 1996, Machine Learning.

[5] Geoffrey J. McLachlan,et al. Mining in the Presence of Selectivity Bias and its Application to Reject Inference , 1998, KDD.

[6] David E. Booth,et al. Analysis of Incomplete Multivariate Data , 2000, Technometrics.

[7] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[8] Alberto Maria Segre,et al. Programs for Machine Learning , 1994 .