Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Background:We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (interaction article subtask [IAS]), discovery of protein pairs (interaction pair subtask [IPS]), and identification of text passages characterizing protein interaction (interaction sentences subtask [ISS]) in full-text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam detection techniques, as well as an uncertainty-based integration scheme. We also used a support vector machine and singular value decomposition on the same features for comparison purposes. Our approach to the full-text subtasks (protein pair and passage identification) includes a feature expansion method based on word proximity networks.Results:Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of measures of performance used in the challenge evaluation (accuracy, F-score, and area under the receiver operating characteristic curve). We also report on a web tool that we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full-text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages.Conclusion:Our approach to abstract classification shows that a simple linear model, using relatively few features, can generalize and uncover the conceptual nature of protein-protein interactions from the bibliome. Because the novel approach is based on a rather lightweight linear model, it can easily be ported and applied to similar problems. In full-text problems, the expansion of word features with word proximity networks is shown to be useful, although the need for some improvements is discussed.

[1]  Susan T. Dumais Enhancing performance in latent semantic indexing , 1990 .

[2]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[3]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[4]  Juan M. Corchado,et al.  SpamHunting: An instance-based reasoning system for spam labelling and filtering , 2007, Decis. Support Syst..

[5]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[6]  Gabriele Ausiello,et al.  MINT: the Molecular INTeraction database , 2006, Nucleic Acids Res..

[7]  T. Shibata,et al.  Stable Association of 70-kDa Heat Shock Protein Induces Latent Multisite Specificity of a Unisite-specific Endonuclease in Yeast Mitochondria* , 1999, The Journal of Biological Chemistry.

[8]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[9]  Werner Dubitzky,et al.  A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[10]  P. Bork,et al.  Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[11]  Luis Mateus Rocha,et al.  Use of Text Mining for Protein Structure Prediction and Functional Annotation in Lack of Sequence Homology , 2006 .

[12]  Ana Gabriela Maguitman,et al.  Uncovering Protein-Protein Interactions in the Bibliome , 2007 .

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[15]  Dmitrij Frishman,et al.  MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2005, Nucleic Acids Res..

[16]  Karin M. Verspoor,et al.  Large-Scale Testing of Bibliome Informatics Using Pfam Protein Families , 2005, Pacific Symposium on Biocomputing.

[17]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[18]  Karin M. Verspoor,et al.  Protein annotation as term categorization in the gene ontology using word proximity networks , 2005, BMC Bioinformatics.