Feature selection for clustering - a filter solution

Processing applications with a large number of dimensions has been a challenge for the KDD community. Feature selection, an effective dimensionality reduction technique, is an essential pre-processing method to remove noisy features. In the literature only a few methods have been proposed for feature selection for clustering, and almost all these methods are 'wrapper' techniques that require a clustering algorithm to evaluate candidate feature subsets. The wrapper approach is largely unsuitable in real-world applications due to its heavy reliance on clustering algorithms that require parameters such as the number of clusters, and the lack of suitable clustering criteria to evaluate clustering in different subspaces. In this paper we propose a 'filter' method that is independent of any clustering algorithm. The proposed method is based on the observation that data with clusters has a very different point-to-point distance histogram to that of data without clusters. By exploiting this we propose an entropy measure that is low if data has distinct clusters and high if it does not. The entropy measure is suitable for selecting the most important subset of features because it is invariant with the number of dimensions, and is affected only by the quality of clustering. Extensive performance evaluation over synthetic, benchmark, and real datasets shows its effectiveness.

[1]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Editors , 1986, Brain Research Bulletin.

[4]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[5]  George H. John Enhancements to the data mining process , 1997 .

[6]  Ashwin Ram,et al.  Efficient Feature Selection in Conceptual Clustering , 1997, ICML.

[7]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[8]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[9]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[10]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[11]  Shivakumar Vaithyanathan,et al.  Model Selection in Unsupervised Learning with Applications To Document Clustering , 1999, International Conference on Machine Learning.

[12]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[13]  Luis Talavera,et al.  Feature Selection as a Preprocessing Step for Hierarchical Clustering , 1999, ICML.

[14]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[15]  Filippo Menczer,et al.  Feature selection in unsupervised learning via evolutionary search , 2000, KDD '00.

[16]  Luis Talavera Feature Selection and Incremental Learning of Probabilistic Concept Hierarchies , 2000, ICML.

[17]  Huan Liu,et al.  Feature Selection for Clustering , 2000, Encyclopedia of Database Systems.

[18]  Carla E. Brodley,et al.  Visualization and interactive feature selection for unsupervised data , 2000, KDD '00.

[19]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[20]  Douglas H. Fisher,et al.  Knowledge acquisition via incremental conceptual clustering , 2004, Machine Learning.