Machine learning: an indispensable tool in bioinformatics.

The increase in the number and complexity of biological databases has raised the need for modern and powerful data analysis tools and techniques. In order to fulfill these requirements, the machine learning discipline has become an everyday tool in bio-laboratories. The use of machine learning techniques has been extended to a wide spectrum of bioinformatics applications. It is broadly used to investigate the underlying mechanisms and interactions between biological molecules in many diseases, and it is an essential tool in any biomarker discovery process. In this chapter, we provide a basic taxonomy of machine learning algorithms, and the characteristics of main data preprocessing, supervised classification, and clustering techniques are shown. Feature selection, classifier evaluation, and two supervised classification topics that have a deep impact on current bioinformatics are presented. We make the interested reader aware of a set of popular web resources, open source software tools, and benchmarking data repositories that are frequently used by the machine learning community.

[1]  Ethem Alpaydin,et al.  Introduction to machine learning , 2004, Adaptive computation and machine learning.

[2]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[3]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[4]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[5]  Pedro Larrañaga,et al.  Bioinformatics Advance Access published August 24, 2007 A review of feature selection techniques in bioinformatics , 2022 .

[6]  G. Parmigiani,et al.  The Analysis of Gene Expression Data , 2003 .

[7]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[8]  Mia K. Markey,et al.  A machine learning perspective on the development of clinical decision support systems utilizing mass spectra of blood samples , 2006, J. Biomed. Informatics.

[9]  Edward R. Dougherty,et al.  Superior feature-set ranking for small samples using bolstered error estimation , 2005, Bioinform..

[10]  Kathleen Marchal,et al.  Advances in Cluster Analysis of Microarray Data , 2005, Data Analysis and Visualization in Genomics and Proteomics.

[11]  Ingo Mierswa,et al.  YALE: rapid prototyping for complex data mining tasks , 2006, KDD '06.

[12]  Franz Pernkopf,et al.  Bayesian network classifiers versus selective k-NN classifier , 2005, Pattern Recognit..

[13]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[14]  Peter Dalgaard,et al.  Introductory statistics with R , 2002, Statistics and computing.

[15]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[16]  M. Hilario,et al.  Processing and classification of protein mass spectra. , 2006, Mass spectrometry reviews.

[17]  Gopal Kanji,et al.  100 Statistical Tests , 1994 .

[18]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[19]  Ian Witten,et al.  Data Mining , 2000 .

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2007 .

[22]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[23]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[24]  John Quackenbush,et al.  Microarray gene expression data analysis - a beginner's guide , 2003 .

[25]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .