Supervised learning with decision tree-based methods in computational and systems biology.

At the intersection between artificial intelligence and statistics, supervised learning allows algorithms to automatically build predictive models from just observations of a system. During the last twenty years, supervised learning has been a tool of choice to analyze the always increasing and complexifying data generated in the context of molecular biology, with successful applications in genome annotation, function prediction, or biomarker discovery. Among supervised learning methods, decision tree-based methods stand out as non parametric methods that have the unique feature of combining interpretability, efficiency, and, when used in ensembles of trees, excellent accuracy. The goal of this paper is to provide an accessible and comprehensive introduction to this class of methods. The first part of the review is devoted to an intuitive but complete description of decision tree-based methods and a discussion of their strengths and limitations with respect to other supervised learning methods. The second part of the review provides a survey of their applications in the context of computational and systems biology.

[1]  Ambuj K. Singh,et al.  Predicting genetic interactions with random walks on biological networks , 2009, BMC Bioinformatics.

[2]  Lei Sun,et al.  EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis , 2008, Bioinform..

[3]  Thomas Lengauer,et al.  Diversity and complexity of HIV-1 drug resistance: A bioinformatics approach to predicting phenotype from genotype , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[5]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[6]  Pierre Geurts,et al.  Inferring biological networks with output kernel trees , 2007, BMC Bioinformatics.

[7]  K. Hornik,et al.  Unbiased Recursive Partitioning: A Conditional Inference Framework , 2006 .

[8]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[9]  Ting Song,et al.  A review of imaging techniques for systems biology , 2008, BMC Systems Biology.

[10]  Louis Wehenkel,et al.  Automatic Learning Techniques in Power Systems , 1997 .

[11]  OlaruCristina,et al.  A complete fuzzy decision tree technique , 2003 .

[12]  Rui Jiang,et al.  A random forest approach to the detection of epistatic interactions in case-control studies , 2009, BMC Bioinformatics.

[13]  K. Lunetta,et al.  Screening large-scale association study data: exploiting interactions using random forests , 2004, BMC Genetics.

[14]  Achim Zeileis,et al.  Conditional variable importance for random forests , 2008, BMC Bioinformatics.

[15]  Yoav Freund,et al.  Lamellipodial Actin Mechanically Links Myosin Activity with Adhesion-Site Formation , 2007, Cell.

[16]  Gertraud Burger,et al.  'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools , 2007, BMC Bioinformatics.

[17]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[18]  Jing Hu,et al.  Identification of deleterious non-synonymous single nucleotide polymorphisms using sequence-derived information , 2008, BMC Bioinformatics.

[19]  Werner Dubitzky,et al.  Briefings in bioinformatics. , 2009, Briefings in bioinformatics.

[20]  Anil K. Kesarwani,et al.  Genome Informatics , 2019, Encyclopedia of Bioinformatics and Computational Biology.

[21]  Gunnar Rätsch,et al.  An Introduction to Boosting and Leveraging , 2002, Machine Learning Summer School.

[22]  J. Ji,et al.  Diagnosis of gastric cancer using decision tree classification of mass spectral data , 2007, Cancer science.

[23]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[24]  Christophe Lemetre,et al.  An introduction to artificial neural networks in bioinformatics - application to complex microarray and mass spectrometry datasets in cancer studies , 2008, Briefings Bioinform..

[25]  Weijian Guo,et al.  Prediction of Pancreatic Cancer by Serum Biomarkers Using Surface-Enhanced Laser Desorption/Ionization-Based Decision Tree Classification , 2005, Oncology.

[26]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[27]  Pierre Geurts,et al.  Kernelizing the output of tree-based methods , 2006, ICML '06.

[28]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[29]  Luc Devroye,et al.  Consistency of Random Forests and Other Averaging Classifiers , 2008, J. Mach. Learn. Res..

[30]  Burkhard Rost,et al.  MetalDetector: a web server for predicting metal-binding sites and disulfide bridges in proteins from sequence , 2008, Bioinform..

[31]  Alvis Brazma,et al.  Current approaches to gene regulatory network modelling , 2007, BMC Bioinformatics.

[32]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[33]  Jeremy J. W. Chen,et al.  A five-gene signature and clinical outcome in non-small-cell lung cancer. , 2007, The New England journal of medicine.

[34]  S. L. Wong,et al.  Combining biological networks to predict genetic interactions. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Cajo J. F. ter Braak,et al.  Predicting and understanding transcription factor interactions based on sequence level determinants of combinatorial control , 2008, Bioinform..

[36]  Amanda Clare,et al.  Predicting gene function in Saccharomyces cerevisiae , 2003, ECCB.

[37]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[38]  P. Schellhammer,et al.  Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profiles discriminates prostate cancer from noncancer patients. , 2002, Clinical chemistry.

[39]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[40]  Ryuhei Uehara,et al.  A double classification tree search algorithm for index SNP selection , 2004, BMC Bioinformatics.

[41]  Andrew J. Bordner,et al.  Predicting small ligand binding sites in proteins using backbone structure , 2008, Bioinform..

[42]  Gunnar Rätsch,et al.  Advanced Lectures on Machine Learning , 2004, Lecture Notes in Computer Science.

[43]  Gerhard Tutz,et al.  A CART-based approach to discover emerging patterns in microarray data , 2003, Bioinform..

[44]  宁北芳,et al.  疟原虫var基因转换速率变化导致抗原变异[英]/Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A , 2005 .

[45]  Sara Light,et al.  Network analysis of metabolic enzyme evolution in Escherichia coli , 2004, BMC Bioinformatics.

[46]  Saso Dzeroski,et al.  Ranking with Predictive Clustering Trees , 2002, ECML.

[47]  Yoav Freund,et al.  A classification-based framework for predicting and analyzing gene regulatory response , 2006, BMC Bioinformatics.

[48]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[49]  Sean R. Collins,et al.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae , 2006, Nature.

[50]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[51]  P. Schyns,et al.  Concept learning , 1998 .

[52]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[53]  Zheng Rong Yang,et al.  Mining SARS-CoV protease cleavage data using non-orthogonal decision trees: a novel method for decisive template selection , 2005, Bioinform..

[54]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[55]  Rolf Apweiler,et al.  Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT , 2001, Bioinform..

[56]  Xiaoyu Chen,et al.  Prediction of tissue-specific cis-regulatory modules using Bayesian networks and regression trees , 2007, BMC Bioinformatics.

[57]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[58]  Yoav Freund,et al.  Image-based crystal detection: a machine-learning approach , 2008, Acta crystallographica. Section D, Biological crystallography.

[59]  G. Izmirlian,et al.  Application of the Random Forest Classification Algorithm to a SELDI‐TOF Proteomics Study in the Setting of a Cancer Prevention Trial , 2004, Annals of the New York Academy of Sciences.

[60]  Xue-wen Chen,et al.  Sequence-based prediction of protein interaction sites with an integrative method , 2009, Bioinform..

[61]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[62]  J. Freidman,et al.  Multivariate adaptive regression splines , 1991 .

[63]  M. Behlol,et al.  Concept of Learning , 2010 .

[64]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[65]  Weibo Liang,et al.  Identification of serum biomarkers for nasopharyngeal carcinoma by proteomic analysis , 2008, Cancer.

[66]  Heinz-Theodor Mevissen,et al.  Decision tree-based formation of consensus protein secondary structure prediction , 1999, Bioinform..

[67]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[68]  A. Dunker The pacific symposium on biocomputing , 1998 .

[69]  M. D. Martínez-Miranda,et al.  Computational Statistics and Data Analysis , 2009 .

[70]  Louis Wehenkel,et al.  A complete fuzzy decision tree technique , 2003, Fuzzy Sets Syst..

[71]  Doheon Lee,et al.  Regression trees for regulatory element identification , 2004, Bioinform..

[72]  Jack Y. Yang,et al.  Investigation of transmembrane proteins using a computational approach , 2008, BMC Genomics.

[73]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[74]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[75]  Yan Cui,et al.  Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information , 2005, Bioinform..

[76]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[77]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[78]  Pierre Geurts,et al.  Pattern Extraction for Time Series Classification , 2001, PKDD.

[79]  John Mingers,et al.  An Empirical Comparison of Pruning Methods for Decision Tree Induction , 1989, Machine Learning.

[80]  Yali Amit,et al.  Shape Quantization and Recognition with Randomized Trees , 1997, Neural Computation.

[81]  Hanchuan Peng,et al.  Bioimage informatics: a new area of engineering biology , 2008, Bioinform..

[82]  Kellie J. Archer,et al.  Empirical characterization of random forest variable importance measures , 2008, Comput. Stat. Data Anal..

[83]  Yoav Freund,et al.  Predicting genetic regulatory response using classification , 2004, ISMB/ECCB.

[84]  Pedro M. Domingos,et al.  Tree Induction for Probability-Based Ranking , 2003, Machine Learning.

[85]  Geoffrey J McLachlan,et al.  Selection bias in gene extraction on the basis of microarray gene-expression data , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[86]  Pierre Geurts,et al.  Exploiting tree-based variable importances to selectively identify relevant variables , 2008, FSDM.

[87]  Ramón Díaz-Uriarte,et al.  Gene selection and classification of microarray data using random forest , 2006, BMC Bioinformatics.

[88]  P. Geurts,et al.  Random subwindows and extremely randomized trees for image classification in cell biology , 2007, BMC Cell Biology.

[89]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[90]  Pedro Larrañaga,et al.  Bioinformatics Advance Access published August 24, 2007 A review of feature selection techniques in bioinformatics , 2022 .

[91]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[92]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[93]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[94]  Xiao Sun,et al.  Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature , 2008, Bioinform..

[95]  Wray L. Buntine,et al.  Learning classification trees , 1992 .

[96]  Paul M. B. Vitányi,et al.  Proceedings of the Second European Conference on Computational Learning Theory , 1995 .

[97]  Carl Kingsford,et al.  What are decision trees? , 2008, Nature Biotechnology.

[98]  C. Ouzounis,et al.  Genome-wide identification of genes likely to be involved in human genetic disease. , 2004, Nucleic acids research.

[99]  Yoav Freund,et al.  The Alternating Decision Tree Learning Algorithm , 1999, ICML.

[100]  Pierre Geurts,et al.  Proteomic mass spectra classification using decision tree based ensemble methods , 2005, Bioinform..

[101]  Z. Hall Cancer , 1906, The Hospital.

[102]  Weixiong Zhang,et al.  A bi-dimensional regression tree approach to the modeling of gene expression regulation , 2006, Bioinform..

[103]  D. Seligson,et al.  Clinical Chemistry , 1965, Bulletin de la Societe de chimie biologique.

[104]  Helen M Berman,et al.  Statistical models for discerning protein structures containing the DNA-binding helix-turn-helix motif. , 2003, Journal of molecular biology.

[105]  Kristian Vlahovicek,et al.  Prediction of Protein–Protein Interaction Sites in Sequences and 3D Structures by Random Forests , 2009, PLoS Comput. Biol..

[107]  Yuji Matsumoto,et al.  An Application of Boosting to Graph Classification , 2004, NIPS.

[108]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[109]  M. Michael Gromiha,et al.  Functional discrimination of membrane proteins using machine learning techniques , 2008, BMC Bioinformatics.

[110]  Manolis Kellis,et al.  Systematic discovery and characterization of fly microRNAs using 12 Drosophila genomes. , 2007, Genome research.

[111]  Nir Friedman,et al.  Tissue classification with gene expression profiles. , 2000 .

[112]  Jennifer A. Siepen,et al.  β Edge strands in protein structure prediction and aggregation , 2003, Protein science : a publication of the Protein Society.

[113]  Michael Q. Zhang,et al.  Computational identification of promoters and first exons in the human genome , 2001, Nature Genetics.

[114]  Pierre Geurts,et al.  Closed-form dual perturb and combine for tree-based models , 2005, ICML.

[115]  Joseph Schlecht,et al.  Machine-Learning Approaches for Classifying Haplogroup from Y Chromosome STR Data , 2008, PLoS Comput. Biol..

[116]  Achim Zeileis,et al.  Bias in random forest variable importance measures: Illustrations, sources and a solution , 2007, BMC Bioinformatics.

[117]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[118]  Steven Salzberg,et al.  Locating Protein Coding Regions in Human DNA Using a Decision Tree Algorithm , 1995, J. Comput. Biol..

[119]  Scott E. Fraser,et al.  Imaging in Systems Biology , 2007, Cell.

[120]  Shinn-Ying Ho,et al.  Computational identification of ubiquitylation sites from protein sequences , 2008, BMC Bioinformatics.

[121]  Sorin Draghici,et al.  Machine Learning and Its Applications to Biology , 2007, PLoS Comput. Biol..

[122]  David G. Stork,et al.  Pattern Classification , 1973 .

[123]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[124]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[125]  Mei Liu,et al.  Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[126]  M. Segal Tree-Structured Methods for Longitudinal Data , 1992 .

[127]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[128]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[129]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[130]  João Gama,et al.  Functional Trees , 2001, Machine Learning.

[131]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[132]  Jie Chen,et al.  Prediction of chronic hepatitis B, liver cirrhosis and hepatocellular carcinoma by SELDI-based serum decision tree classification , 2007, Journal of Cancer Research and Clinical Oncology.

[133]  BMC Bioinformatics , 2005 .