MeSHup: A Corpus for Full Text Biomedical Document Indexing

被引:0
|
作者
Wang, Xindi [1 ,3 ]
Mercer, Robert E. [1 ,3 ]
Rudzicz, Frank [2 ,3 ,4 ]
机构
[1] Univ Western Ontario, Dept Comp Sci, London, ON, Canada
[2] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada
[3] Vector Inst, Toronto, ON, Canada
[4] Unity Hlth Toronto, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会;
关键词
MeSH Indexing; Multi-label text classification;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Medical Subject Heading (MeSH) indexing refers to the problem of assigning a given biomedical document with the most relevant labels from an extremely large set of MeSH terms. Currently, the vast number of biomedical articles in the PubMed database are manually annotated by human curators, which is time consuming and costly; therefore, a computational system that can assist the indexing is highly valuable. When developing supervised MeSH indexing systems, the availability of a large-scale annotated text corpus is desirable. A publicly available, large corpus that permits robust evaluation and comparison of various systems is important to the research community. We release a large scale annotated MeSH indexing corpus, MeSHup, which contains 1,342,667 full text articles in English, together with the associated MeSH labels and metadata, authors, and publication venues that are collected from the MEDLINE database. We train an end-to-end model that combines features from documents and their associated labels on our corpus and report the new baseline.
引用
收藏
页码:5473 / 5483
页数:11
相关论文
共 50 条
  • [1] Biomedical Text Mining Applied to Document Retrieval and Semantic Indexing
    Lourenco, Analia
    Carneiro, Sonia
    Ferreira, Eugenio C.
    Carreira, Rafael
    Rocha, Luis M.
    Glez-Pena, Daniel
    Mendez, Jose R.
    Fdez-Riverola, Florentino
    Diaz, Fernando
    Rocha, Isabel
    Rocha, Miguel
    DISTRIBUTED COMPUTING, ARTIFICIAL INTELLIGENCE, BIOINFORMATICS, SOFT COMPUTING, AND AMBIENT ASSISTED LIVING, PT II, PROCEEDINGS, 2009, 5518 : 954 - +
  • [2] Document indexing in text categorization
    Zhang, QR
    Zhang, L
    Dong, SB
    Tan, JH
    PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3792 - 3796
  • [3] Visualization of Text Document Corpus
    Fortuna, Blaz
    Grobelnik, Marko
    Mladenic, Dunja
    INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2005, 29 (04): : 497 - 502
  • [4] Indexing, full text research, and digital text dossiers -: a test of document analysis methods in the press archive of the Stuttgarter Zeitung
    Palos, S
    NFD INFORMATION-WISSENSCHAFT UND PRAXIS, 1999, 50 (07): : 413 - 419
  • [5] Arabic Document Indexing for Improved Text Retrieval
    Al-Lahham, Yaser A. M.
    2019 2ND INTERNATIONAL CONFERENCE ON NEW TRENDS IN COMPUTING SCIENCES (ICTCS), 2019, : 226 - 230
  • [6] An approach to evaluate existing ontologies for indexing a document corpus
    Hernandez, N
    Mothe, J
    ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2004, 3192 : 11 - 21
  • [7] BioDR: Semantic indexing networks for biomedical document retrieval
    Lourenco, Analia
    Carreira, Rafael
    Glez-Pena, Daniel
    Mendez, Jose R.
    Carneiro, Sonia
    Rocha, Luis M.
    Diaz, Fernando
    Ferreira, Eugenio C.
    Rocha, Isabel
    Fdez-Riverola, Florentino
    Rocha, Miguel
    EXPERT SYSTEMS WITH APPLICATIONS, 2010, 37 (04) : 3444 - 3453
  • [8] WEIGHTED AUTOMATA FOR FULL-TEXT INDEXING
    Zhang, Meng
    Hu, Liang
    Zhang, Yi
    INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE, 2011, 22 (04) : 921 - 943
  • [9] Adding full text indexing to the operating system
    Peltonen, K
    13TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING - PROCEEDINGS, 1997, : 386 - 390
  • [10] Czech Text Document Corpus v 2.0
    Kral, Pavel
    Lenc, Ladislav
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 4345 - 4348