Document Clustering Using Differential Evolution

This paper investigates a novel approach for partitional clustering of a large collection of text documents by using an improved version of the classical differential algorithm (DE). Fast and accurate clustering of documents plays an important role in the field of text mining and automatic information retrieval systems. The k-means has served as the most widely used partitional clustering algorithm for text documents. However, in most cases it provides only locally optimal solutions. In this work, the clustering problem has been formulated as an optimization task and is solved using a modified DE algorithm. To reduce the computational time, a hybrid k-means with DE method has also been proposed. The new algorithms were tested on a number of document datasets. Comparison with k-means, a state of the art PSO and one recently proposed real coded GA based text clustering methods reflects the superiority of the proposed techniques in terms of speed and quality of clustering.

[1]  Gareth Jones,et al.  Non-hierarchic document clustering using a genetic algorithm , 1995, Information Research.

[2]  Amit Konar,et al.  Two improved differential evolution schemes for faster global search , 2005, GECCO '05.

[3]  Vijay V. Raghavan,et al.  A clustering strategy based on a formalism of the reproductive process in natural systems , 1979, SIGIR '79.

[4]  Brian Everitt,et al.  Cluster analysis , 1974 .

[5]  Saman K. Halgamuge,et al.  Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients , 2004, IEEE Transactions on Evolutionary Computation.

[6]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[7]  Kay Hameyer,et al.  Optimization of radial active magnetic bearings using the finite element technique and the differential evolution algorithm , 2000 .

[8]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[9]  Thomas E. Potok,et al.  Document clustering using particle swarm optimization , 2005, Proceedings 2005 IEEE Swarm Intelligence Symposium, 2005. SIS 2005..

[10]  B. Everitt,et al.  Cluster Analysis (2nd ed). , 1982 .

[11]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[12]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[13]  A. Hamler,et al.  Analysis of iron loss in interior permanent magnet synchronous motor over a wide-speed range of constant output power operation , 2000 .

[14]  M.-C. Su,et al.  A new cluster validity measure and its application to image compression , 2004, Pattern Analysis and Applications.

[15]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[16]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[17]  R. W. Derksen,et al.  Differential Evolution in Aerodynamic Optimization , 1999 .

[18]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[19]  Gerald Salton,et al.  Automatic text processing , 1988 .

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  Kalyanmoy Deb,et al.  A Computationally Efficient Evolutionary Algorithm for Real-Parameter Optimization , 2002, Evolutionary Computation.

[22]  Sandra Paterlini,et al.  Evolutionary Approaches for Cluster Analysis , 2003 .

[23]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[24]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[25]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[26]  Robert H. Gross,et al.  Web Page Categorization and Feature Selection Using Association Rule and Principal Component Cluster , 1997 .

[27]  Dieter Merkl Industry: text mining with self-organizing maps , 2002 .

[28]  Vijay V. Raghavan,et al.  A clustering strategy based on a formalism of the reproductive process in natural systems , 1979, SIGIR 1979.