论文信息 - Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks

Background:We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (interaction article subtask [IAS]), discovery of protein pairs (interaction pair subtask [IPS]), and identification of text passages characterizing protein interaction (interaction sentences subtask [ISS]) in full-text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam detection techniques, as well as an uncertainty-based integration scheme. We also used a support vector machine and singular value decomposition on the same features for comparison purposes. Our approach to the full-text subtasks (protein pair and passage identification) includes a feature expansion method based on word proximity networks.Results:Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of measures of performance used in the challenge evaluation (accuracy, F-score, and area under the receiver operating characteristic curve). We also report on a web tool that we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full-text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages.Conclusion:Our approach to abstract classification shows that a simple linear model, using relatively few features, can generalize and uncover the conceptual nature of protein-protein interactions from the bibliome. Because the novel approach is based on a rather lightweight linear model, it can easily be ported and applied to similar problems. In full-text problems, the expansion of word features with word proximity networks is shown to be useful, although the need for some improvements is discussed.

[1] Susan T. Dumais. Enhancing performance in latent semantic indexing , 1990 .

[2] Burr Settles,et al. ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[3] Y. Zhang,et al. IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[4] Juan M. Corchado,et al. SpamHunting: An instance-based reasoning system for spam labelling and filtering , 2007, Decis. Support Syst..

[5] Thorsten Joachims,et al. Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[6] Gabriele Ausiello,et al. MINT: the Molecular INTeraction database , 2006, Nucleic Acids Res..

[7] T. Shibata,et al. Stable Association of 70-kDa Heat Shock Protein Induces Latent Multisite Specificity of a Unisite-specific Endonuclease in Yeast Mitochondria* , 1999, The Journal of Biological Chemistry.

[8] Hagit Shatkay,et al. Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[9] Werner Dubitzky,et al. A Practical Approach to Microarray Data Analysis , 2003, Springer US.

[10] P. Bork,et al. Literature mining for the biologist: from information retrieval to biological discovery , 2006, Nature Reviews Genetics.

[11] Luis Mateus Rocha,et al. Use of Text Mining for Protein Structure Prediction and Functional Annotation in Lack of Sequence Homology , 2006 .

[12] Ana Gabriela Maguitman,et al. Uncovering Protein-Protein Interactions in the Bibliome , 2007 .

[13] Vladimir Vapnik,et al. Statistical learning theory , 1998 .

[14] Luis Mateus Rocha,et al. Singular value decomposition and principal component analysis , 2003 .

[15] Dmitrij Frishman,et al. MIPS: analysis and annotation of proteins from whole genomes in 2005 , 2005, Nucleic Acids Res..

[16] Karin M. Verspoor,et al. Large-Scale Testing of Bibliome Informatics Using Pfam Protein Families , 2005, Pacific Symposium on Biocomputing.

[17] Alfonso Valencia,et al. Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[18] Karin M. Verspoor,et al. Protein annotation as term categorization in the gene ontology using word proximity networks , 2005, BMC Bioinformatics.