An evolutionary clustering algorithm for gene expression microarray data analysis

Clustering is concerned with the discovery of interesting groupings of records in a database. Many algorithms have been developed to tackle clustering problems in a variety of application domains. In particular, some of them have been used in bioinformatics research to uncover inherent clusters in gene expression microarray data. In this paper, we show how some popular clustering algorithms have been used for this purpose. Based on experiments using simulated and real data, we also show that the performance of these algorithms can be further improved. For more effective clustering of gene expression microarray data, which is typically characterized by a lot of noise, we propose a novel evolutionary algorithm called evolutionary clustering (EvoCluster). EvoCluster encodes an entire cluster grouping in a chromosome so that each gene in the chromosome encodes one cluster. Based on such encoding scheme, it makes use of a set of reproduction operators to facilitate the exchange of grouping information between chromosomes. The fitness function that the EvoCluster adopts is able to differentiate between how relevant a feature value is in determining a particular cluster grouping. As such, instead of just local pairwise distances, it also takes into consideration how clusters are arranged globally. Unlike many popular clustering algorithms, EvoCluster does not require the number of clusters to be decided in advance. Also, patterns hidden in each cluster can be explicitly revealed and presented for easy interpretation even by casual users. For performance evaluation, we have tested EvoCluster using both simulated and real data. Experimental results show that it can be very effective and robust even in the presence of noise and missing values. Also, when correlating the gene expression microarray data with DNA sequences, we were able to uncover significant biological binding sites (both previously known and unknown) in each cluster discovered by EvoCluster.

[1]  Nikhil R. Pal,et al.  A novel approach to design classifiers using genetic programming , 2004, IEEE Transactions on Evolutionary Computation.

[2]  Mark Schena,et al.  DNA microarrays : a practical approach , 1999 .

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  M Gerstein,et al.  DNA recognition code of transcription factors. , 1995, Protein engineering.

[8]  Zbigniew Michalewicz,et al.  Evolutionary Computation 2 : Advanced Algorithms and Operators , 2000 .

[9]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[10]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[12]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[13]  C. L. Liu,et al.  Introduction to Combinatorial Mathematics. , 1971 .

[14]  Kara Dolinski,et al.  Saccharomyces Genome Database provides tools to survey gene expression and functional analysis data , 2001, Nucleic Acids Res..

[15]  J. Collado-Vides,et al.  Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. , 2000, Nucleic acids research.

[16]  P. Brown,et al.  Yeast microarrays for genome wide parallel genetic and gene expression analysis. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Keith C. C. Chan,et al.  Discovering clusters in databases containing mixed continuous and discrete-valued attributes , 1999, Defense, Security, and Sensing.

[18]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[19]  Andrew K. C. Wong,et al.  Learning sequential patterns for probabilistic inductive prediction , 1994 .

[20]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[21]  Andrew K. C. Wong,et al.  Information synthesis based on hierarchical maximum entropy discretization , 1990, J. Exp. Theor. Artif. Intell..

[22]  Zbigniew Michalewicz,et al.  Handbook of Evolutionary Computation , 1997 .

[23]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[24]  James C. Bezdek,et al.  Clustering with a genetically optimized approach , 1999, IEEE Trans. Evol. Comput..

[25]  Chris H. Q. Ding,et al.  Analysis of gene expression profiles: class discovery and leaf ordering , 2002, RECOMB '02.

[26]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[27]  Walter L. Smith Probability and Statistics , 1959, Nature.

[28]  Jiawei Han,et al.  Cancer classification using gene expression data , 2003, Inf. Syst..

[29]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Elizabeth W. Jones,et al.  Genetics: Analysis of Genes and Genomes , 2001 .

[31]  David B. Fogel,et al.  Evolution-ary Computation 1: Basic Algorithms and Operators , 2000 .

[32]  K Matsubara,et al.  Correlation between gene functions and developmental expression patterns in the mouse cerebellum , 2000, The European journal of neuroscience.

[33]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[34]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[35]  Yang Wang,et al.  From Association to Classification: Inference Using Weight of Evidence , 2003, IEEE Trans. Knowl. Data Eng..

[36]  Andrew K. C. Wong,et al.  Statistical Technique for Extracting Classificatory Knowledge from Databases , 1991, Knowledge Discovery in Databases.

[37]  Eleonora Kurtenbach,et al.  Genomic expression pattern in Saccharomyces cerevisiae cells in response to high hydrostatic pressure , 2004, FEBS letters.

[38]  Olli Nevalainen,et al.  Genetic Algorithms for Large-Scale Clustering Problems , 1997, Comput. J..

[39]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[40]  Francisco Herrera,et al.  Using evolutionary algorithms as instance selection for data reduction in KDD: an experimental study , 2003, IEEE Trans. Evol. Comput..

[41]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[42]  Wentian Li,et al.  How Many Genes are Needed for a Discriminant Microarray Data Analysis , 2001, physics/0104029.

[43]  G. Sherlock Analysis of large-scale gene expression data. , 2000, Current opinion in immunology.

[44]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[45]  J. Blow,et al.  Cell cycle control of replication initiation in eukaryotes. , 1996, Current opinion in cell biology.

[46]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[47]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[48]  M. Xiong,et al.  Biomarker Identification by Feature Wrappers , 2022 .

[49]  Weimin Xiao,et al.  Evolving accurate and compact classification rules with gene expression programming , 2003, IEEE Trans. Evol. Comput..

[50]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[51]  Keith C. C. Chan,et al.  Mining fuzzy association rules , 1997, CIKM '97.

[52]  J. Collado-Vides,et al.  Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. , 1998, Journal of molecular biology.

[53]  Jiashun Zheng,et al.  An approach to identify over-represented cis-elements in related sequences. , 2003, Nucleic acids research.

[54]  P. Brazhnik,et al.  Linking the genes: inferring quantitative gene networks from microarray data. , 2002, Trends in genetics : TIG.

[55]  Xin Yao,et al.  A novel evolutionary data mining algorithm with applications to churn prediction , 2003, IEEE Trans. Evol. Comput..

[56]  Emanuel Falkenauer,et al.  Genetic Algorithms and Grouping Problems , 1998 .

[57]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[58]  M. Bittner,et al.  Expression profiling using cDNA microarrays , 1999, Nature Genetics.

[59]  Ronald W. Davis,et al.  Transcriptional regulation and function during the human cell cycle , 2001, Nature Genetics.

[60]  C. A. Murthy,et al.  In search of optimal clusters using genetic algorithms , 1996, Pattern Recognit. Lett..

[61]  Walter L. Ruzzo,et al.  Improved Gene Selection for Classification of Microarrays , 2002, Pacific Symposium on Biocomputing.

[62]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[63]  Ujjwal Maulik,et al.  Genetic algorithm-based clustering technique , 2000, Pattern Recognit..

[64]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[65]  Graham Cameron,et al.  One-stop shop for microarray data , 2000, Nature.

[66]  Vladimir Pavlovic,et al.  RankGene: identification of diagnostic genes based on expression data , 2003, Bioinform..

[67]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[68]  R. Wallace Is this a practical approach? , 2001, Journal of the American College of Surgeons.

[69]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[70]  M. Narasimha Murty,et al.  Genetic K-means algorithm , 1999, IEEE Trans. Syst. Man Cybern. Part B.

[71]  E. Winzeler,et al.  Genomics, gene expression and DNA arrays , 2000, Nature.

[72]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[73]  B. Young,et al.  Molecular biology in medicine. , 1992, Postgraduate medical journal.

[74]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[75]  M. Narasimha Murty,et al.  Clustering with evolution strategies , 1994, Pattern Recognit..

[76]  Alex Alves Freitas,et al.  Data mining with an ant colony optimization algorithm , 2002, IEEE Trans. Evol. Comput..