Substring selection for biomedical document classification

被引:15
|
作者
Han, Bo
Obradovic, Zoran
Hu, Zhang-Zhi
Wu, Cathy H.
Vucetic, Slobodan [1 ]
机构
[1] Temple Univ, Ctr Informat Sci & Technol, Philadelphia, PA 19122 USA
[2] Georgetown Univ, Med Ctr, Dept Biochem & Mol & Cellular Biol, Washington, DC 20007 USA
基金
美国国家卫生研究院;
关键词
D O I
10.1093/bioinformatics/btl350
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Attribute selection is a critical step in development of document classification systems. As a standard practice, words are stemmed and the most informative ones are used as attributes in classification. Owing to high complexity of biomedical terminology, general-purpose stemming algorithms are often conservative and could also remove informative stems. This can lead to accuracy reduction, especially when the number of labeled documents is small. To address this issue, we propose an algorithm that omits stemming and, instead, uses the most discriminative substrings as attributes. Results: The approach was tested on five annotated sets of abstracts from iProLINKthat report on the experimental evidence about five types of protein post-translational modifications. The experiments showed that Naive Bayes and support vector machine classifiers perform consistently better[with area under the ROC curve (AUC) accuracy in range 0.92-0.97] when usingthe proposed attribute selection than when using attributes obtained by the Porter stemmer algorithm (AUC in 0.86-0.93 range). The proposed approach is particularly useful when labeled clatasets are small. Contact: vucetic@ist.temple.edu Supplementary Information: The supplementary data are available from www.ist.tempie.edu/PIRsupplement.
引用
收藏
页码:2136 / 2142
页数:7
相关论文
共 50 条
  • [1] Feature selection for document type classification
    Taghva, Kazem
    Vergara, Jason
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: NEW GENERATIONS, 2008, : 179 - 182
  • [2] Utilizing image and caption information for biomedical document classification
    Li, Pengyuan
    Jiang, Xiangying
    Zhang, Gongbo
    Trabucco, Juan Trelles
    Raciti, Daniela
    Smith, Cynthia
    Ringwald, Martin
    Marai, G. Elisabeta
    Arighi, Cecilia
    Shatkay, Hagit
    [J]. BIOINFORMATICS, 2021, 37 : I468 - I476
  • [3] Evaluating the effect of unbalanced data in biomedical document classification
    Laza, Rosalia
    Pavon, Reyes
    Reboiro-Jato, Miguel
    Fdez-Riverola, Florentino
    [J]. JOURNAL OF INTEGRATIVE BIOINFORMATICS, 2011, 8 (03)
  • [4] Logic classification and feature selection for biomedical data
    Bertolazzi, P.
    Felici, G.
    Festa, P.
    Lancia, G.
    [J]. COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2008, 55 (05) : 889 - 899
  • [5] Feature selection for document classification based on topology
    El Barbary, O. G.
    Salama, A. S.
    [J]. EGYPTIAN INFORMATICS JOURNAL, 2018, 19 (02) : 129 - 132
  • [6] The impact of feature selection on medical document classification
    Parlak, Bekir
    Uysal, Alper Kursat
    [J]. 2016 11TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI), 2016,
  • [7] Feature selection for the classification of large document collections
    Brank, Janez
    Mladenic, Dunja
    Grobelnik, Marko
    Milic-Frayling, Natasa
    [J]. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2008, 14 (10) : 1562 - 1596
  • [8] Discriminative Feature Analysis and Selection for Document Classification
    Chinta, Punya Murthy
    Murty, M. Narasimha
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2012, PT I, 2012, 7663 : 366 - 374
  • [9] Parameterized Intractability of Distinguishing Substring Selection
    Jens Gramm
    Jiong Guo
    Rolf Niedermeier
    [J]. Theory of Computing Systems, 2006, 39 : 545 - 560
  • [10] UTILIZING IMAGE-BASED FEATURES IN BIOMEDICAL DOCUMENT CLASSIFICATION
    Ma, Kaidi
    Jeong, Hogyeong
    Rohith, M., V
    Somanath, Gowri
    Tarpine, Ryan
    Schutter, Kyle
    Blostein, Dorothea
    Istrail, Sorin
    Kambhamettu, Chandra
    Shatkay, Hagit
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2015, : 4451 - 4455