Classification of Full Text Biomedical Documents: Sections Importance Assessment

被引:3
|
作者
Oliveira Goncalves, Carlos Adriano [1 ,2 ,3 ,6 ]
Camacho, Rui [4 ]
Goncalves, Celia Talma [5 ]
Seara Vieira, Adrian [1 ,2 ,3 ]
Borrajo Diz, Lourdes [1 ,2 ,3 ]
Lorenzo Iglesias, Eva [1 ,2 ,3 ]
机构
[1] Univ Vigo, Dept Comp Sci, Escuela Super Ingn Informat, Orense 32004, Spain
[2] Univ Vigo, CINBIO Biomed Res Ctr, Vigo 36310, Spain
[3] SERGAS UVIGO, Galicia Sur Hlth Res Inst IIS Galicia Sur, SING Res Grp, Vigo 36310, Spain
[4] Univ Porto, LIAAD INESC TEC, Fac Engn, P-4200465 Porto, Portugal
[5] LIACC, CEOS PP, ISCAP P PORTO, Campus FEUP, P-436900 Porto, Portugal
[6] Escuela Super Ingn Informat, Orense 32004, Spain
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 06期
关键词
full text classification; preprocessing techniques; section weighing scheme; information retrieval;
D O I
10.3390/app11062674
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how the efficiency of the text classification changes if different weights are previously assigned to the sections that compose the documents. The proposal takes into account the place (section) where terms are located in the document, and each section has a weight that can be modified depending on the corpus. To carry out the study, an extended version of the OHSUMED corpus with full documents have been created. Through the use of WEKA, we compared the use of abstracts only with that of full texts, as well as the use of section weighing combinations to assess their significance in the scientific article classification process using the SMO (Sequential Minimal Optimization), the WEKA Support Vector Machine (SVM) algorithm implementation. The experimental results show that the proposed combinations of the preprocessing techniques and feature selection achieve promising results for the task of full text scientific document classification. We also have evidence to conclude that enriched datasets with text from certain sections achieve better results than using only titles and abstracts.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] The Research of Text Preprocessing Effect on Text Documents Classification Efficiency
    Kurbatow, Andrew
    2015 INTERNATIONAL CONFERENCE "STABILITY AND CONTROL PROCESSES" IN MEMORY OF V.I. ZUBOV (SCP), 2015, : 653 - 655
  • [22] Hierarchical Method for Automated Text Documents Classification
    Mousa, Mohamed H.
    Khedr, Ayman E.
    Idrees, Amira M.
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2025, 22 (01) : 11 - 19
  • [23] Distributed boosting algorithm for classification of text documents
    Sarnovsky, Martin
    Vronc, Michal
    2014 IEEE 12TH INTERNATIONAL SYMPOSIUM ON APPLIED MACHINE INTELLIGENCE AND INFORMATICS (SAMI), 2014, : 216 - 219
  • [24] Text classification without labeled negative documents
    Fung, GPC
    Yu, JX
    Lu, HJ
    Yu, PS
    ICDE 2005: 21ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2005, : 594 - 605
  • [25] Knowledge Supervised Text Classification with No Labeled Documents
    Zhang, Congle
    Xue, Gui-Rong
    Yu, Yong
    PRICAI 2008: TRENDS IN ARTIFICIAL INTELLIGENCE, 2008, 5351 : 509 - +
  • [26] Towards topic driven access to full text documents
    Caracciolo, C
    van Hage, W
    de Rijke, M
    RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES, 2004, 3232 : 495 - 500
  • [27] PsyDok: Electronic Full text Archive for Psychological Documents
    Herb, Ulrich
    SEVENTH INTERNATIONAL CONFERENCE ON GREY LITERATURE, GL7 CONFERENCE PROCEEDINGS, 2006, (07): : 81 - 86
  • [28] Open access to scholarly full-text documents
    Jacso, Peter
    ONLINE INFORMATION REVIEW, 2006, 30 (05) : 587 - 594
  • [29] Generation of Synthetic Images of Full-Text Documents
    Bures, Lukas
    Neduchal, Petr
    Hlavac, Miroslav
    Hruz, Marek
    SPEECH AND COMPUTER (SPECOM 2018), 2018, 11096 : 68 - 75
  • [30] SF-CNN: Deep Text Classification and Retrieval for Text Documents
    Sarasu, R.
    Thyagharajan, K. K.
    Shanker, N. R.
    INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 35 (02): : 1799 - 1813