A gene–phenotype relationship extraction pipeline from the biomedical literature using a representation learning approach

Motivation The fundamental challenge of modern genetic analysis is to establish gene‐phenotype correlations that are often found in the large‐scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study‐ or author‐specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants. Results We have proposed a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improved the unsupervised word‐embedding‐to‐sentence‐embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary‐ and rule‐based method was applied for gene recognition. Finally, we integrated one of famous information extraction system OLLIE to identify gene‐phenotype relations. To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state‐of‐the‐art baselines, our approach obtained the best performance (F1‐Measure of 66.83%). We also applied the pipeline to 481 full‐articles from TAIR gene‐phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it. Availability and implementation The source code is available at http://www.wutbiolab.cn:82/Gene‐Phenotype‐Relation‐Extraction‐Pipeline.zip.

[1]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[2]  Oren Etzioni,et al.  Identifying Relations for Open Information Extraction , 2011, EMNLP.

[3]  Zhiyong Lu,et al.  GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains , 2015, BioMed research international.

[4]  Hyunju Lee,et al.  An analysis of disease-gene relationship from Medline abstracts by DigSee , 2017, Scientific Reports.

[5]  Yu Xue,et al.  MBA: a literature mining system for extracting biomedical abbreviations , 2009, BMC Bioinformatics.

[6]  Peter Szolovits,et al.  Bridging semantics and syntax with graph algorithms - state-of-the-art of extracting biomedical relations , 2017, Briefings Bioinform..

[7]  Dietrich Rebholz-Schuhmann,et al.  Harmonization of gene/protein annotations: towards a gold standard MEDLINE , 2012, Bioinform..

[8]  Lin Li,et al.  Cascade word embedding to sentence embedding: A class label enhanced approach to phenotype extraction , 2017, 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[9]  Xiaofang Zhang,et al.  Protein-Protein Interaction Network Constructing Based on Text Mining and Reinforcement Learning with Application to Prostate Cancer , 2014, 2015 IEEE Trustcom/BigDataSE/ISPA.

[10]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[11]  Xiao Zhang,et al.  Multiple kernel learning in protein-protein interaction extraction from biomedical literature , 2011, Artif. Intell. Medicine.

[12]  Georgios A. Pavlopoulos,et al.  Protein-protein interaction predictions using text mining methods. , 2015, Methods.

[13]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[14]  Russ B. Altman,et al.  Author ' s personal copy Using text to build semantic networks for pharmacogenomics , 2010 .

[15]  Karsten M. Borgwardt,et al.  AraPheno: a public database for Arabidopsis thaliana phenotypes , 2016, Nucleic Acids Res..

[16]  Paloma Martínez,et al.  SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013) , 2013, *SEMEVAL.

[17]  Zhiyong Lu,et al.  Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine , 2016, PLoS Comput. Biol..

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  A. Greenberg,et al.  Next-generation phenotyping: requirements and strategies for enhancing our understanding of genotype–phenotype relationships and its relevance to crop improvement , 2013, Theoretical and Applied Genetics.

[20]  David S. Wishart,et al.  Nucleic Acids Research Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs and Metabolites , 2008 .

[21]  Isabel Segura-Bedmar,et al.  Drug name recognition and classification in biomedical texts. A case study outlining approaches underpinning automated systems. , 2008, Drug discovery today.

[22]  Hung-Yu Kao,et al.  Cross-species gene normalization by species inference , 2011, BMC Bioinformatics.

[23]  Hongwei Guo,et al.  AHD2.0: an update version of Arabidopsis Hormone Database for plant systematic studies , 2010, Nucleic Acids Res..

[24]  Hassan Foroosh,et al.  NELasso: Group-Sparse Modeling for Characterizing Relations Among Named Entities in News Articles , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[26]  Dragomir R. Radev,et al.  Identifying gene-disease associations using centrality on a literature mined gene-interaction network , 2008, ISMB.

[27]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[28]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[29]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[30]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[31]  Madhuri Hegde,et al.  Genotype-phenotype correlations in neurogenetics: Lesch-Nyhan disease as a model disorder. , 2014, Brain : a journal of neurology.

[32]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[33]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[34]  L Hunter,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[35]  Isabel Segura-Bedmar,et al.  The 1st DDIExtraction-2011 challenge task: Extraction of Drug-Drug Interactions from biomedical texts , 2011 .

[36]  Kenji Araki,et al.  Language Combinatorics: A Sentence Pattern Extraction Architecture Based on Combinatorial Explosion , 2011 .

[37]  Dietrich Rebholz-Schuhmann,et al.  PhenoMiner: from text to a database of phenotypes associated with OMIM diseases , 2015, Database J. Biol. Databases Curation.

[38]  Hui Yang,et al.  Phenolyzer: phenotype-based prioritization of candidate genes for human diseases , 2015, Nature Methods.

[39]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..