Classification of Full Text Biomedical Documents: Sections Importance Assessment

被引:3
|
作者
Oliveira Goncalves, Carlos Adriano [1 ,2 ,3 ,6 ]
Camacho, Rui [4 ]
Goncalves, Celia Talma [5 ]
Seara Vieira, Adrian [1 ,2 ,3 ]
Borrajo Diz, Lourdes [1 ,2 ,3 ]
Lorenzo Iglesias, Eva [1 ,2 ,3 ]
机构
[1] Univ Vigo, Dept Comp Sci, Escuela Super Ingn Informat, Orense 32004, Spain
[2] Univ Vigo, CINBIO Biomed Res Ctr, Vigo 36310, Spain
[3] SERGAS UVIGO, Galicia Sur Hlth Res Inst IIS Galicia Sur, SING Res Grp, Vigo 36310, Spain
[4] Univ Porto, LIAAD INESC TEC, Fac Engn, P-4200465 Porto, Portugal
[5] LIACC, CEOS PP, ISCAP P PORTO, Campus FEUP, P-436900 Porto, Portugal
[6] Escuela Super Ingn Informat, Orense 32004, Spain
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 06期
关键词
full text classification; preprocessing techniques; section weighing scheme; information retrieval;
D O I
10.3390/app11062674
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how the efficiency of the text classification changes if different weights are previously assigned to the sections that compose the documents. The proposal takes into account the place (section) where terms are located in the document, and each section has a weight that can be modified depending on the corpus. To carry out the study, an extended version of the OHSUMED corpus with full documents have been created. Through the use of WEKA, we compared the use of abstracts only with that of full texts, as well as the use of section weighing combinations to assess their significance in the scientific article classification process using the SMO (Sequential Minimal Optimization), the WEKA Support Vector Machine (SVM) algorithm implementation. The experimental results show that the proposed combinations of the preprocessing techniques and feature selection achieve promising results for the task of full text scientific document classification. We also have evidence to conclude that enriched datasets with text from certain sections achieve better results than using only titles and abstracts.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Classification of text documents
    Li, YH
    Jain, AK
    FOURTEENTH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1 AND 2, 1998, : 1295 - 1297
  • [2] Classification of text documents
    Li, YH
    Jain, AK
    COMPUTER JOURNAL, 1998, 41 (08): : 537 - 546
  • [3] Natural Language Processing Techniques for Text Classification of Biomedical Documents: A Systematic Review
    YetuYetu Kesiku, Cyrille
    Chaves-Villota, Andrea
    Garcia-Zapirain, Begonya
    INFORMATION, 2022, 13 (10)
  • [4] Classification of RSS-formatted documents using full text similarity measures
    Wegrzyn-Wolska, K
    Szczepaniak, PS
    WEB ENGINEERING, PROCEEDINGS, 2005, 3579 : 400 - 405
  • [5] Classification of Protein-Protein Interaction Full-Text Documents Using Text and Citation Network Features
    Kolchinsky, Artemy
    Abi-Haidar, Alaa
    Kaur, Jasleen
    Hamed, Ahmed Abdeen
    Rocha, Luis M.
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2010, 7 (03) : 400 - 411
  • [6] Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics
    Cunningham, Hamish
    Tablan, Valentin
    Roberts, Angus
    Bontcheva, Kalina
    PLOS COMPUTATIONAL BIOLOGY, 2013, 9 (02)
  • [7] Text Documents Classification by Associating Terms with Text Categories
    Srividhya, V.
    Anitha, R.
    APPLICATIONS OF SOFT COMPUTING: FROM THEORY TO PRAXIS, 2009, 58 : 223 - +
  • [8] Text classification for Chinese web documents
    Hu, Ming
    Xu, Jianchao
    Hu, Liang
    COMPUTATIONAL METHODS, PTS 1 AND 2, 2006, : 1171 - +
  • [9] A fuzzy approach to classification of text documents
    WeiYi Liu
    Ning Song
    Journal of Computer Science and Technology, 2003, 18 : 640 - 647
  • [10] Classification of compressed and uncompressed text documents
    Bhushan, N. Bharath
    Danti, Ajit
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 88 : 614 - 623