Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles

被引:9
|
作者
Zheng, Wu [1 ]
Blake, Catherine [2 ]
机构
[1] Univ Illinois, Grad Sch Lib & Informat Sci, Champaign, IL 61820 USA
[2] Univ Illinois, CIRSS, Grad Sch Lib & Informat Sci & Med Informat Sci, Champaign, IL 61820 USA
基金
美国国家科学基金会;
关键词
BioNLP; Text mining; Relation extraction; Distant supervised learning; Protein subcellular localization extraction; GENE ONTOLOGY; PREDICTION; SEQUENCE; FEATURES;
D O I
10.1016/j.jbi.2015.07.013
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Databases of curated biomedical knowledge, such as the protein-locations reflected in the UniProtKB database, provide an accurate and useful resource to researchers and decision makers. Our goal is to augment the manual efforts currently used to curate knowledge bases with automated approaches that leverage the increased availability of full-text scientific articles. This paper describes experiments that use distant supervised learning to identify protein subcellular localizations, which are important to understand protein function and to identify candidate drug targets. Experiments consider Swiss-Prot, the manually annotated subset of the UniProtKB protein knowledge base, and 43,000 full-text articles from the Journal of Biological Chemistry that contain just under 11.5 million sentences. The system achieves 0.81 precision and 0.49 recall at sentence level and an accuracy of 57% on held-out instances in a test set. Moreover, the approach identifies 8210 instances that are not in the UniProtKB knowledge base. Manual inspection of the 50 most likely relations showed that 41(82%) were valid. These results have immediate benefit to researchers interested in protein function, and suggest that distant supervision should be explored to complement other manual data curation efforts. (C) 2015 Elsevier Inc. All rights reserved.
引用
收藏
页码:134 / 144
页数:11
相关论文
共 50 条
  • [1] A Large Parallel Corpus of Full-Text Scientific Articles
    Soares, Felipe
    Moreira, Viviane Pereira
    Becker, Karin
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3459 - 3463
  • [2] Layout-aware text extraction from full-text PDF of scientific articles
    Ramakrishnan, Cartic
    Patnia, Abhishek
    Hovy, Eduard
    Burns, Gully A. P. C.
    [J]. SOURCE CODE FOR BIOLOGY AND MEDICINE, 2012, 7 (01):
  • [3] Efficient Extraction of Protein-Protein Interactions from Full-Text Articles
    Hakenberg, Joerg
    Leaman, Robert
    Vo, Nguyen Ha
    Jonnalagadda, Siddhartha
    Sullivan, Ryan
    Miller, Christopher
    Tari, Luis
    Baral, Chitta
    Gonzalez, Graciela
    [J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (03) : 481 - 494
  • [4] Structured abstract summarization of scientific articles: Summarization using full-text section information
    Oh, Hanseok
    Nam, Seojin
    Zhu, Yongjun
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2023, 74 (02) : 234 - 248
  • [5] Chemical identification and indexing in PubMed full-text articles using deep learning and heuristics
    Almeida, Tiago
    Antunes, Rui
    Silva, Joao F.
    Almeida, Joao R.
    Matos, Sergio
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2022, 2022
  • [6] Using R to develop a corpus of full-text journal articles
    Anderson, Billie
    Bani-Yaghoub, Majid
    Kantheti, Vagmi
    Curtis, Scott
    [J]. JOURNAL OF INFORMATION SCIENCE, 2023,
  • [7] Beyond genes, proteins, and abstracts: Identifying scientific claims from full-text biomedical articles
    Blake, Catherine
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2010, 43 (02) : 173 - 189
  • [8] Using Full-text of Academic Articles to Find Software Clusters
    Zhang, Heng
    Ma, Shutian
    Zhang, Chengzhi
    [J]. 17TH INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS (ISSI2019), VOL II, 2019, : 2776 - 2777
  • [9] Weakly Supervised Multi-Label Classification of Full-Text Scientific Papers
    Zhang, Yu
    Jin, Bowen
    Chen, Xiusi
    Shen, Yanzhen
    Zhang, Yunyi
    Meng, Yu
    Han, Jiawei
    [J]. PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 3458 - 3469
  • [10] Deep context of citations using machine-learning models in scholarly full-text articles
    Hassan, Saeed-Ul
    Imran, Mubashir
    Iqbal, Sehrish
    Aljohani, Naif Radi
    Nawaz, Raheel
    [J]. SCIENTOMETRICS, 2018, 117 (03) : 1645 - 1662