A feature weighted penalty based dissimilarity measure for k-nearest neighbor classification with missing features

kNN-FWPD classifier is proposed with FWPD as the underlying dissimilarity measure.kNN-FWPD classifier can be directly applied to datasets having missing features.The proposed classifier has similar time complexity compared to the kNN classifier.Experiments are conducted on 4 types of missingness: MCAR, MAR, MNAR1, and MNAR2.kNN-FWPD is found to outperform ZI, AI, and kNNI in terms of classification accuracy. The k-Nearest Neighbor (kNN) classifier is an elegant learning algorithm widely used because of its simple and non-parametric nature. However, like most learning algorithms, kNN cannot be directly applied to data plagued by missing features. We make use of the philosophy of a Penalized Dissimilarity Measure (PDM) and incorporate a PDM called the Feature Weighted Penalty based Dissimilarity (FWPD) into kNN, forming the kNN-FWPD classifier which can be directly applied to datasets with missing features, without any preprocessing (like marginalization or imputation). Extensive experimentation on simulations of four different missing feature mechanisms (using various datasets) suggests that the proposed method can handle the missing feature problem much more effectively compared to some of the popular imputation mechanisms (used in conjunction with kNN).

[1]  Swagatam Das,et al.  Clustering with missing features: a penalized dissimilarity measure based approach , 2016, Machine Learning.

[2]  Pieter Abbeel,et al.  Max-margin Classification of Data with Absent Features , 2008, J. Mach. Learn. Res..

[3]  D. Rubin,et al.  Multiple Imputation for Nonresponse in Surveys , 1989 .

[4]  Xu Huang,et al.  Iterative weighted k-NN for constructing missing feature values in Wisconsin breast cancer dataset , 2011, The 3rd International Conference on Data Mining and Intelligent Information Technology Applications.

[5]  J. Graham,et al.  Missing data analysis: making it work in the real world. , 2009, Annual review of psychology.

[6]  Michel Verleysen,et al.  K nearest neighbours with mutual information for simultaneous classification and missing data imputation , 2009, Neurocomputing.

[7]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[8]  Edgar Acuña,et al.  The Treatment of Missing Values and its Effect on Classifier Accuracy , 2004 .

[9]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[10]  T. H. Bø,et al.  LSimpute: accurate estimation of missing values in microarray data with least squares methods. , 2004, Nucleic acids research.

[11]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[12]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[13]  D. Rubin INFERENCE AND MISSING DATA , 1975 .

[14]  Fabrizio Angiulli,et al.  Nearest Neighbor-Based Classification of Uncertain Data , 2013, TKDD.

[15]  John K. Dixon,et al.  Pattern Recognition with Partly Missing Data , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  O. J. Dunn,et al.  The Treatment of Missing Values in Discriminant Analysis—I. The Sampling Experiment , 1972 .

[17]  Quan Pan,et al.  Adaptive imputation of missing values for incomplete pattern classification , 2016, Pattern Recognit..

[18]  T. Stijnen,et al.  Review: a gentle introduction to imputation of missing values. , 2006, Journal of clinical epidemiology.

[19]  Iqbal Gondal,et al.  Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data , 2005, Bioinform..