Cascade word embedding to sentence embedding: A class label enhanced approach to phenotype extraction

In molecular biology, phenotypes are often described using complex semantics and diverse biomedical expressions, thereby facilitating the development of named entity recognition (NER). Here, we propose a novel approach of recognizing plant phenotypes by cascading word embedding to sentence embedding with a class label enhancement. We utilized a word embedding method to find high-frequency phenotypes with original sentences used as input in a sentence embedding method. Using this cascaded approach, we identified author-specific phenotypic expressions. In addition, we integrated a negative class label enhanced (NCLE) algorithm into our method to further optimize the training model of Sen2Vec. We used 56,748 PubMed abstracts of model organism Arabidopsis thaliana to test the effectiveness of our approach, which results in a 135% increase in the number of new phenotypic descriptions compared with the original phenotype ontology.

[1]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Xiaojun Wan,et al.  A Neural Network Approach to Quote Recommendation in Writings , 2016, CIKM.

[3]  Zhenchao Jiang,et al.  Training word embeddings for deep learning in biomedical text mining tasks , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[4]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[5]  Bairong Shen,et al.  Protein interaction network constructing based on text mining and reinforcement learning with application to prostate cancer. , 2015, IET systems biology.

[6]  Charu C. Aggarwal,et al.  Linked Document Embedding for Classification , 2016, CIKM.

[7]  Amar K. Das,et al.  Unsupervised Method for Automatic Construction of a Disease Dictionary from a Large Free Text Collection , 2008, AMIA.

[8]  Hyunju Lee,et al.  An analysis of disease-gene relationship from Medline abstracts by DigSee , 2017, Scientific Reports.

[9]  Dietrich Rebholz-Schuhmann,et al.  A Hybrid Approach to Finding Phenotype Candidates in Genetic Texts , 2012, COLING.

[10]  Jin Mao,et al.  Microbial phenomics information extractor (MicroPIE): a natural language processing tool for the automated acquisition of prokaryotic phenotypic characters from text sources , 2016, BMC Bioinformatics.

[11]  Nigel Collier,et al.  Automatic Term Identification and Classification in Biology Texts. , 1999 .

[12]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[13]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Hongfei Lin,et al.  Biomedical event trigger detection by dependency-based word embedding , 2015, BIBM.

[16]  David S. Wishart,et al.  Nucleic Acids Research Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs Polysearch: a Web-based Text Mining System for Extracting Relationships between Human Diseases, Genes, Mutations, Drugs and Metabolites , 2008 .

[17]  Antje Chang,et al.  BRENDA in 2017: new perspectives and new tools in BRENDA , 2016, Nucleic Acids Res..

[18]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[19]  L Hunter,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[20]  Hongwei Guo,et al.  AHD2.0: an update version of Arabidopsis Hormone Database for plant systematic studies , 2010, Nucleic Acids Res..

[21]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[22]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[23]  Rabab Kreidieh Ward,et al.  Deep Sentence Embedding Using Long Short-Term Memory Networks: Analysis and Application to Information Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Fabrizio Silvestri,et al.  Context- and Content-aware Embeddings for Query Rewriting in Sponsored Search , 2015, SIGIR.

[25]  Isabel Segura-Bedmar,et al.  Drug name recognition and classification in biomedical texts. A case study outlining approaches underpinning automated systems. , 2008, Drug discovery today.

[26]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[27]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[28]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[29]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[30]  Robert E. Mercer,et al.  A Machine Learning Approach for Phenotype Name Recognition , 2012, COLING.

[31]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[32]  Xiao Zhang,et al.  Multiple kernel learning in protein-protein interaction extraction from biomedical literature , 2011, Artif. Intell. Medicine.

[33]  Georgios A. Pavlopoulos,et al.  Protein-protein interaction predictions using text mining methods. , 2015, Methods.

[34]  C. Sabatti,et al.  The Human Phenome Project , 2003, Nature Genetics.

[35]  Carol Friedman,et al.  Extracting Phenotypic Information from the Literature via Natural Language Processing , 2004, MedInfo.

[36]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[37]  Xiaofang Zhang,et al.  Protein-protein interaction network constructing based on text mining and reinforcement learning with application to prostate cancer , 2015, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[38]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[39]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[40]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[41]  Russ B. Altman,et al.  Author ' s personal copy Using text to build semantic networks for pharmacogenomics , 2010 .

[42]  Karsten M. Borgwardt,et al.  AraPheno: a public database for Arabidopsis thaliana phenotypes , 2016, Nucleic Acids Res..