Automatic document classification of biological literature

被引:31
|
作者
Chen, David
Muller, Hans-Michael [1 ]
Sternberg, Paul W.
机构
[1] CALTECH, Div Biol, Pasadena, CA 91125 USA
[2] CALTECH, Howard Hughes Med Inst, Pasadena, CA 91125 USA
关键词
D O I
10.1186/1471-2105-7-370
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Automatic Classification of Cross-document Structural Relations for Discussion Summarization
    Almahy, Ibrahim
    Salim, Naomie
    [J]. NEW TRENDS IN SOFTWARE METHODOLOGIES, TOOLS AND TECHNIQUES, 2014, 265 : 979 - 990
  • [42] Automatic document classification based on probabilistic reasoning: Model and performance analysis
    Lam, W
    Low, KF
    [J]. SMC '97 CONFERENCE PROCEEDINGS - 1997 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5: CONFERENCE THEME: COMPUTATIONAL CYBERNETICS AND SIMULATION, 1997, : 2719 - 2723
  • [43] Hierarchical content classification and script determination for automatic document image processing
    Wang, Q
    Chi, Z
    Zhao, RC
    [J]. 16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL III, PROCEEDINGS, 2002, : 77 - 80
  • [44] Automatic Extraction of Non-Textual Information in Web Document and Their Classification
    Zachariasova, Martina
    Hudec, Robert
    Benco, Miroslav
    Kamencay, Patrik
    [J]. 2012 35TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2012, : 753 - 757
  • [45] The Application of Automatic Document Classification to Cancer Staging for Esophageal Pathological Reports
    Sun, Yung-Han
    Hsieh, Chih-Cheng
    Chen, Chun-Hsien
    [J]. GASTROENTEROLOGY, 2011, 140 (05) : S1050 - S1050
  • [46] The application of an automatic document classification system to aid the organizers of ICED 2001
    Lowe, A
    McMahon, CA
    Shah, T
    Culley, SJ
    [J]. DESIGN MANAGEMENT - PROCESS AND INFORMATION ISSUES, 2001, : 179 - 186
  • [47] Hierarchical content classification and script determination for automatic document image processing
    Chi, Z
    Wang, Q
    Siu, WC
    [J]. PATTERN RECOGNITION, 2003, 36 (11) : 2483 - 2500
  • [48] A comparative study of two automatic document classification methods in a library setting
    Pong, Joanna Yi-Hang
    Kwok, Ron Chi-Wai
    Lau, Raymond Yiu-Keung
    Hao, Jin-Xing
    Wong, Percy Ching-Chi
    [J]. JOURNAL OF INFORMATION SCIENCE, 2008, 34 (02) : 213 - 230
  • [49] Automatic Classification of Algorithm Citation Functions in Scientific Literature
    Tuarob, Suppawong
    Kang, Sung Woo
    Wettayakorn, Poom
    Pornprasit, Chanatip
    Sachati, Tanakitti
    Hassan, Saeed-Ul
    Haddawy, Peter
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (10) : 1881 - 1896
  • [50] SIMILARITY COEFFICIENTS AND WEIGHTING FUNCTIONS FOR AUTOMATIC DOCUMENT CLASSIFICATION - AN EMPIRICAL-COMPARISON
    WILLETT, P
    [J]. INTERNATIONAL CLASSIFICATION, 1983, 10 (03): : 138 - 142