Automatic document classification of biological literature

被引:31
|
作者
Chen, David
Muller, Hans-Michael [1 ]
Sternberg, Paul W.
机构
[1] CALTECH, Div Biol, Pasadena, CA 91125 USA
[2] CALTECH, Howard Hughes Med Inst, Pasadena, CA 91125 USA
关键词
D O I
10.1186/1471-2105-7-370
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Automatic document classification of biological literature
    David Chen
    Hans-Michael Müller
    Paul W Sternberg
    [J]. BMC Bioinformatics, 7
  • [2] AUTOMATIC DOCUMENT CLASSIFICATION
    BORKO, H
    BERNICK, M
    [J]. JOURNAL OF THE ACM, 1963, 10 (02) : 151 - &
  • [3] Document Classification And Automatic Grading
    Subramaniyan, G. L. Sankara
    Vishwa, S. Yajith
    Yogith, T.
    Uma, K., V
    Deisy, C.
    [J]. 2022 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRICAL, COMPUTING, COMMUNICATION AND SUSTAINABLE TECHNOLOGIES (ICAECT), 2022,
  • [4] AN EXPERIMENT IN AUTOMATIC HIERARCHICAL DOCUMENT CLASSIFICATION
    GARLAND, K
    [J]. INFORMATION PROCESSING & MANAGEMENT, 1983, 19 (03) : 113 - 120
  • [5] THEORY OF RELEVANCE FOR AUTOMATIC DOCUMENT CLASSIFICATION
    HEAPS, HS
    [J]. INFORMATION AND CONTROL, 1973, 22 (03): : 268 - 278
  • [6] THE USE OF TITLES FOR AUTOMATIC DOCUMENT CLASSIFICATION
    HAMILL, KA
    ZAMORA, A
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 1980, 31 (06): : 396 - 402
  • [7] THE CLASSIFICATION OF BIOLOGICAL LITERATURE
    MILLS, J
    [J]. ASLIB PROCEEDINGS, 1981, 33 (04): : 165 - 171
  • [8] Automatic classification of accounting literature
    Chakraborty, Vasundhara
    Chiu, Victoria
    Vasarhelyi, Miklos
    [J]. INTERNATIONAL JOURNAL OF ACCOUNTING INFORMATION SYSTEMS, 2014, 15 (02) : 122 - 148
  • [9] A New Method of Automatic Text Document Classification
    Yatsko, V. A.
    [J]. AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2021, 55 (03) : 122 - 133
  • [10] QUERY-SPECIFIC AUTOMATIC DOCUMENT CLASSIFICATION
    WILLETT, P
    [J]. INTERNATIONAL FORUM ON INFORMATION AND DOCUMENTATION, 1985, 10 (02): : 28 - 32