On protocols and measures for the validation of supervised methods for the inference of biological networks

Networks provide a natural representation of molecular biology knowledge, in particular to model relationships between biological entities such as genes, proteins, drugs, or diseases. Because of the effort, the cost, or the lack of the experiments necessary for the elucidation of these networks, computational approaches for network inference have been frequently investigated in the literature. In this paper, we examine the assessment of supervised network inference. Supervised inference is based on machine learning techniques that infer the network from a training sample of known interacting and possibly non-interacting entities and additional measurement data. While these methods are very effective, their reliable validation in silico poses a challenge, since both prediction and validation need to be performed on the basis of the same partially known network. Cross-validation techniques need to be specifically adapted to classification problems on pairs of objects. We perform a critical review and assessment of protocols and measures proposed in the literature and derive specific guidelines how to best exploit and evaluate machine learning techniques for network inference. Through theoretical considerations and in silico experiments, we analyze in depth how important factors influence the outcome of performance estimation. These factors include the amount of information available for the interacting entities, the sparsity and topology of biological networks, and the lack of experimentally verified non-interacting pairs.

[1]  Jean-Philippe Vert,et al.  A bagging SVM to learn from positive and unlabeled examples , 2010, Pattern Recognit. Lett..

[2]  Yoshihiro Yamanishi,et al.  Supervised Bipartite Graph Inference , 2008, NIPS.

[3]  Galina V. Glazko,et al.  Statistical Inference and Reverse Engineering of Gene Regulatory Networks from Observational Expression Data , 2012, Front. Gene..

[4]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[5]  Bing Liu,et al.  Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression , 2003, ICML.

[6]  Tsuyoshi Kato,et al.  Selective integration of multiple biological data for supervised network inference , 2005, Bioinform..

[7]  Jean-Philippe Vert,et al.  On learning with kernels for unordered pairs , 2010, ICML.

[8]  Chee Keong Kwoh,et al.  Drug-target interaction prediction by learning from local information and neighbors , 2013, Bioinform..

[9]  Roded Sharan,et al.  A Network-Based Method for Predicting Disease-Causing Genes , 2009, J. Comput. Biol..

[10]  Shan Zhao,et al.  Mining protein networks for synthetic genetic interactions , 2008, BMC Bioinformatics.

[11]  Michele Ceccarelli,et al.  Selection of negative examples in learning gene regulatory networks , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine Workshop.

[12]  Huanxiang Liu,et al.  A novel method for protein‐ligand binding affinity prediction and the related descriptors exploration , 2009, J. Comput. Chem..

[13]  Hua Yu,et al.  A Systematic Prediction of Multiple Drug-Target Interactions from Chemical, Genomic, and Pharmacological Data , 2012, PloS one.

[14]  Jarl E. S. Wikberg,et al.  Kinome-wide interaction modelling using alignment-based and alignment-independent approaches for kinase description and linear and non-linear data analysis techniques , 2010, BMC Bioinformatics.

[15]  Mark Gerstein,et al.  Training set expansion: an approach to improving the reconstruction of biological networks from limited and uneven reliable interactions , 2008, Bioinform..

[16]  Charles Elkan,et al.  Learning gene regulatory networks from only positive and unlabeled data , 2010, BMC Bioinformatics.

[17]  B. Reiser,et al.  Estimation of the Youden Index and its Associated Cutoff Point , 2005, Biometrical journal. Biometrische Zeitschrift.

[18]  William Stafford Noble,et al.  Large-scale prediction of protein-protein interactions from structures , 2010, BMC Bioinformatics.

[19]  S. L. Wong,et al.  Combining biological networks to predict genetic interactions. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[20]  M. Porter,et al.  Critical Truths About Power Laws , 2012, Science.

[21]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[22]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[23]  Ziv Bar-Joseph,et al.  Evaluation of different biological data and computational classification methods for use in protein interaction prediction , 2006, Proteins.

[24]  Jean-Philippe Vert,et al.  Reconstruction of Biological Networks by Supervised Machine Learning Approaches , 2008 .

[25]  R. Shamir,et al.  Towards accurate imputation of quantitative genetic interactions , 2009, Genome Biology.

[26]  Vipin Kumar,et al.  An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions , 2010, PLoS Comput. Biol..

[27]  Derek Greene,et al.  Missing value imputation for epistatic MAPs , 2010, BMC Bioinformatics.

[28]  Mark A. Ragan,et al.  Supervised, semi-supervised and unsupervised inference of gene regulatory networks , 2013, Briefings Bioinform..

[29]  Satoshi Niijima,et al.  Cross-Target View to Feature Selection: Identification of Molecular Interaction Features in Ligand-Target Space , 2011, J. Chem. Inf. Model..

[30]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[31]  Ola Spjuth,et al.  Proteochemometric Modeling of the Susceptibility of Mutated Variants of the HIV-1 Virus to Reverse Transcriptase Inhibitors , 2010, PloS one.

[32]  Pierre Geurts,et al.  Inferring biological networks with output kernel trees , 2007, BMC Bioinformatics.

[33]  Yanjun Qi,et al.  Prediction of Interactions Between HIV-1 and Human Proteins by Information Integration , 2008, Pacific Symposium on Biocomputing.

[34]  Elena Marchiori,et al.  Gaussian interaction profile kernels for predicting drug-target interaction , 2011, Bioinform..

[35]  Darby Tien-Hao Chang,et al.  Predicting the protein-protein interactions using primary structures with predicted protein surface , 2010, BMC Bioinformatics.

[36]  Charles Elkan,et al.  Learning classifiers from only positive and unlabeled data , 2008, KDD.

[37]  William Stafford Noble,et al.  Choosing negative examples for the prediction of protein-protein interactions , 2006, BMC Bioinformatics.

[38]  Joachim M. Buhmann,et al.  2010 International Conference on Pattern Recognition The binormal assumption on precision-recall curves , 2022 .

[39]  Yoshihiro Yamanishi,et al.  Prediction of drug–target interaction networks from the integration of chemical and genomic spaces , 2008, ISMB.

[40]  Xiangxue Wang An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions , 2015 .

[41]  Emanuele Della Valle,et al.  An Introduction to Information Retrieval , 2013 .

[42]  H. Yabuuchi,et al.  Analysis of multiple compound–protein interactions reveals novel bioactive molecules , 2011, Molecular systems biology.

[43]  Jean-Philippe Vert,et al.  SIRENE: supervised inference of regulatory networks , 2008, ECCB.

[44]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[45]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[46]  Yoshihiro Yamanishi,et al.  Supervised Graph Inference , 2004, NIPS.

[47]  K. Chou,et al.  Predicting Drug-Target Interaction Networks Based on Functional Groups and Biological Features , 2010, PloS one.

[48]  Jürgen Kurths,et al.  Unraveling gene regulatory networks from time-resolved gene expression data -- a measures comparison study , 2011, BMC Bioinformatics.

[49]  Chuang Liu,et al.  Prediction of Drug-Target Interactions and Drug Repositioning via Network-Based Inference , 2012, PLoS Comput. Biol..

[50]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[51]  Florence d'Alché-Buc,et al.  Semi-supervised Penalized Output Kernel Regression for Link Prediction , 2011, ICML.

[52]  Yungki Park,et al.  Revisiting the negative example sampling problem for predicting protein-protein interactions , 2011, Bioinform..

[53]  Louise C. Showe,et al.  Learning from positive examples when the negative class is undetermined- microRNA gene identification , 2008, Algorithms for Molecular Biology.

[54]  William Stafford Noble,et al.  A new pairwise kernel for biological network inference with support vector machines , 2007, BMC Bioinformatics.

[55]  Yoshihiro Yamanishi,et al.  Supervised prediction of drug–target interactions using bipartite local models , 2009, Bioinform..

[56]  Yoshihiro Yamanishi,et al.  Supervised enzyme network inference from the integration of genomic data and chemical information , 2005, ISMB.

[57]  Pierre Geurts,et al.  Learning from positive and unlabeled examples by enforcing statistical significance , 2011, AISTATS.

[58]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[59]  Andreas Fischer,et al.  Pairwise support vector machines and their application to large scale problems , 2012, J. Mach. Learn. Res..

[60]  Roland Eils,et al.  RIP: the regulatory interaction predictor - a machine learning-based approach for predicting target genes of transcription factors , 2011, Bioinform..

[61]  Mei Liu,et al.  Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[62]  Jean-Philippe Vert,et al.  Supervised reconstruction of biological networks with local models , 2007, ISMB/ECCB.

[63]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[64]  Yoshihiro Yamanishi,et al.  Drug target prediction using adverse event report systems: a pharmacogenomic approach , 2012, Bioinform..

[65]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.