Gene Selection for Cancer Classification using Support Vector Machines

DNA micro-arrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new micro-array devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues.In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA micro-arrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer.In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leave-one-out error, while 64 genes are necessary for the baseline method to get the best result (one leave-one-out error). In the colon cancer database, using only 4 genes our method is 98% accurate, while the baseline method is only 86% accurate.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  David S. Touretzky,et al.  Advances in neural information processing systems 2 , 1989 .

[3]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[4]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[5]  Christopher M. Bishop,et al.  Advances in Neural Information Processing Systems 8 (NIPS 1995) , 1991 .

[6]  D. Harlan,et al.  The human myristoylated alanine-rich C kinase substrate (MARCKS) gene (MACS). Analysis of its gene product, promoter, and chromosomal localization. , 1991, The Journal of biological chemistry.

[7]  David Haussler,et al.  Proceedings of the fifth annual workshop on Computational learning theory , 1992, COLT 1992.

[8]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[9]  Bruce D'Ambrosio,et al.  Proceedings of the Eighth international conference on Uncertainty in artificial intelligence , 1992 .

[10]  Isabelle Guyon,et al.  Discovering Informative Patterns and Data Cleaning , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  T. Macalma,et al.  Molecular Characterization of Human Zyxin* , 1996, The Journal of Biological Chemistry.

[12]  Jürgen Schürmann,et al.  Pattern classification , 1996 .

[13]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[14]  S. P. Fodor DNA SEQUENCING: Massively Parallel Genomics , 1997, Science.

[15]  G. Karakiulakis,et al.  Increased type IV collagen-degrading activity in metastases originating from primary tumors of the human colon. , 1997, Invasion & metastasis.

[16]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[17]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[20]  Paul S. Bradley,et al.  Feature Selection via Mathematical Programming , 1997, INFORMS J. Comput..

[21]  Isabelle Guyon,et al.  What Size Test Set Gives Good Error Rate Estimates? , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  C. Ghigna,et al.  Altered expression of heterogenous nuclear ribonucleoproteins and SR factors in human colon adenocarcinomas. , 1998, Cancer research.

[23]  Kristin P. Bennett,et al.  Multicategory Classification by Support Vector Machines , 1999, Comput. Optim. Appl..

[24]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[25]  D. Gordon NIH looks to parlay budget boon toward more patient-oriented research , 1999 .

[26]  T. L. Moser,et al.  Angiostatin binds ATP synthase on the surface of human endothelial cells. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[27]  E. C. Oliveira Chronic Trypanosoma cruzi infeccion associated to colon cancer. An experimental study in rats , 1999 .

[28]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[29]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[30]  U. Thorsteinsdóttir,et al.  The Oncoprotein E2A-Pbx1a Collaborates with Hoxa9 To Acutely Transform Primary Bone Marrow Cells , 1999, Molecular and Cellular Biology.

[31]  Christian A. Rees,et al.  Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[32]  M. Osaka,et al.  MSF (MLL septin-like fusion), a fusion partner gene of MLL, in a therapy-related acute myeloid leukemia with a t(11;17)(q23;q25). , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[33]  D. Gordon Epidemiologic evidence underscores role for folate as foiler of colon cancer , 1999 .

[34]  Hava T. Siegelmann,et al.  A Support Vector Method for Clustering , 2000, NIPS.

[35]  Trevor Hastie,et al.  Gene Shaving: a new class of clustering methods for expression arrays , 2000 .

[36]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[37]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[38]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[39]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[40]  B. Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, ICML.

[41]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Jill P. Mesirov,et al.  Support Vector Machine Classification of Microarray Data , 2001 .

[43]  Tommi S. Jaakkola,et al.  Feature Selection and Dualities in Maximum Entropy Discrimination , 2000, UAI.

[44]  Jason Weston,et al.  Gene functional classification from heterogeneous data , 2001, RECOMB.

[45]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[46]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[47]  Dana Ron,et al.  An Experimental and Theoretical Comparison of Model Selection Methods , 1995, COLT '95.

[48]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[49]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.