Proteomic mass spectra classification using decision tree based ensemble methods

MOTIVATION Modern mass spectrometry allows the determination of proteomic fingerprints of body fluids like serum, saliva or urine. These measurements can be used in many medical applications in order to diagnose the current state or predict the evolution of a disease. Recent developments in machine learning allow one to exploit such datasets, characterized by small numbers of very high-dimensional samples. RESULTS We propose a systematic approach based on decision tree ensemble methods, which is used to automatically determine proteomic biomarkers and predictive models. The approach is validated on two datasets of surface-enhanced laser desorption/ionization time of flight measurements, for the diagnosis of rheumatoid arthritis and inflammatory bowel diseases. The results suggest that the methodology can handle a broad class of similar problems.

[1]  G. Izmirlian,et al.  Application of the Random Forest Classification Algorithm to a SELDI‐TOF Proteomics Study in the Setting of a Cancer Prevention Trial , 2004, Annals of the New York Academy of Sciences.

[2]  Elena Marchiori,et al.  Analysis of Proteomic Pattern Data for Cancer Detection , 2004, EvoWorkshops.

[3]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[4]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[5]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[7]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[8]  Thomas G. Dietterich An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization , 2000, Machine Learning.

[9]  P. Schellhammer,et al.  Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. , 2002, Clinical chemistry.

[10]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[11]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[12]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[13]  M. Kostrzewa,et al.  Mass spectrometry-based clinical proteomics. , 2003, Pharmacogenomics.

[14]  Huiqing Liu,et al.  Discovery of significant rules for classifying cancer diagnosis data , 2003, ECCB.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[17]  Louis Wehenkel,et al.  Automatic Learning Techniques in Power Systems , 1997 .

[18]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[19]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[20]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[21]  E. Fung,et al.  ProteinChip clinical proteomics: computational challenges and solutions. , 2002, BioTechniques.

[22]  E. Fung,et al.  Proteomic approaches to tumor marker discovery. , 2002, Archives of pathology & laboratory medicine.

[23]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.