An effective evolutionary algorithm for discrete-valued data clustering

Clustering is concerned with the discovery of interesting groupings of records in a database. Of the many algorithms have been developed to tackle clustering problems in a variety of application domains, a lot of effort has been put into the development of effective algorithms for handling spatial data. These algorithms were originally developed to handle continuous-valued attributes, and the distance functions such as the Euclidean distance measure are often used to measure the pair-wise similarity/distance between records so as to determine the cluster memberships of records. Since such distance functions cannot be validly defined in non-Euclidean space, these algorithms therefore cannot be used to handle databases that contain discrete-valued data. Owing to the fact that data in the real-life databases are always described by a set of descriptive attributes, many of which are not numerical or inherently ordered in any way, it is important that a clustering algorithm should be developed to handle data mining tasks involving them. In this paper, we propose an effective evolutionary clustering algorithm for this problem. For performance evaluation, we have tested the proposed algorithm using several real data sets. Experimental results show that it outperforms the existing algorithms commonly used for discrete-valued data clustering, and also, when dealing with mixed continuous- and discrete-valued data, its performance is also promising.

[1]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[2]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[3]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[4]  Andrew K. C. Wong,et al.  Statistical Technique for Extracting Classificatory Knowledge from Databases , 1991, Knowledge Discovery in Databases.

[5]  Peter C. Cheeseman,et al.  Bayesian Classification (AutoClass): Theory and Results , 1996, Advances in Knowledge Discovery and Data Mining.

[6]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  C. A. Murthy,et al.  In search of optimal clusters using genetic algorithms , 1996, Pattern Recognit. Lett..

[9]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[10]  Andrew K. C. Wong,et al.  Learning sequential patterns for probabilistic inductive prediction , 1994 .

[11]  M. Narasimha Murty,et al.  Clustering with evolution strategies , 1994, Pattern Recognit..

[12]  Gregory R. Grant,et al.  Statistical Methods in Bioinformatics , 2001 .

[13]  Yang Wang,et al.  From Association to Classification: Inference Using Weight of Evidence , 2003, IEEE Trans. Knowl. Data Eng..

[14]  Lakhmi C. Jain,et al.  Evolutionary computation in data mining , 2005 .

[15]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[16]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[17]  James C. Bezdek,et al.  Clustering with a genetically optimized approach , 1999, IEEE Trans. Evol. Comput..

[18]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[19]  Sudipto Guha,et al.  ROCK: A Robust Clustering Algorithm for Categorical Attributes , 2000, Inf. Syst..