Classification of Full Text Biomedical Documents: Sections Importance Assessment

被引:3
|
作者
Oliveira Goncalves, Carlos Adriano [1 ,2 ,3 ,6 ]
Camacho, Rui [4 ]
Goncalves, Celia Talma [5 ]
Seara Vieira, Adrian [1 ,2 ,3 ]
Borrajo Diz, Lourdes [1 ,2 ,3 ]
Lorenzo Iglesias, Eva [1 ,2 ,3 ]
机构
[1] Univ Vigo, Dept Comp Sci, Escuela Super Ingn Informat, Orense 32004, Spain
[2] Univ Vigo, CINBIO Biomed Res Ctr, Vigo 36310, Spain
[3] SERGAS UVIGO, Galicia Sur Hlth Res Inst IIS Galicia Sur, SING Res Grp, Vigo 36310, Spain
[4] Univ Porto, LIAAD INESC TEC, Fac Engn, P-4200465 Porto, Portugal
[5] LIACC, CEOS PP, ISCAP P PORTO, Campus FEUP, P-436900 Porto, Portugal
[6] Escuela Super Ingn Informat, Orense 32004, Spain
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 06期
关键词
full text classification; preprocessing techniques; section weighing scheme; information retrieval;
D O I
10.3390/app11062674
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how the efficiency of the text classification changes if different weights are previously assigned to the sections that compose the documents. The proposal takes into account the place (section) where terms are located in the document, and each section has a weight that can be modified depending on the corpus. To carry out the study, an extended version of the OHSUMED corpus with full documents have been created. Through the use of WEKA, we compared the use of abstracts only with that of full texts, as well as the use of section weighing combinations to assess their significance in the scientific article classification process using the SMO (Sequential Minimal Optimization), the WEKA Support Vector Machine (SVM) algorithm implementation. The experimental results show that the proposed combinations of the preprocessing techniques and feature selection achieve promising results for the task of full text scientific document classification. We also have evidence to conclude that enriched datasets with text from certain sections achieve better results than using only titles and abstracts.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Automatic Classification of Text Documents Presenting Radiology Examinations
    Klos, Monika
    Zylkowski, Jaroslaw
    Spinczyk, Dominik
    INFORMATION TECHNOLOGY IN BIOMEDICINE (ITIB 2018), 2019, 762 : 495 - 505
  • [42] On the classification of text documents taking into account their structural features
    Gulin, V. V.
    Frolov, A. B.
    JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL, 2016, 55 (03) : 394 - 403
  • [43] Feature Transformations for Outlier Detection in Classification of Text Documents
    Walkowiak, Tomasz
    NEW ADVANCES IN DEPENDABILITY OF NETWORKS AND SYSTEMS, DEPCOS-RELCOMEX 2022, 2022, 484 : 361 - 370
  • [44] Feature Extraction in Subject Classification of Text Documents in Polish
    Walkowiak, Tomasz
    Datko, Szymon
    Maciejewski, Henryk
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING (ICAISC 2018), PT II, 2018, 10842 : 445 - 452
  • [45] Automatic Classification of Project Documents on the Basis of Text Content
    Al Qady, Mohammed
    Kandil, Amr
    JOURNAL OF COMPUTING IN CIVIL ENGINEERING, 2015, 29 (03)
  • [46] On the classification of text documents taking into account their structural features
    V. V. Gulin
    A. B. Frolov
    Journal of Computer and Systems Sciences International, 2016, 55 : 394 - 403
  • [47] On building a full-text digital library of historical documents
    Chen, Szu-Pei
    Hsiang, Jieh
    Tu, Hsieh-Chang
    Wu, Micha
    ASIAN DIGITAL LIBRARIES: LOOKING BACK 10 YEARS AND FORGING NEW FRONTIERS, PROCEEDINGS, 2007, 4822 : 49 - +
  • [48] DESIGN AND USE OF FULL-TEXT DATABASES FOR ELECTRONIC DOCUMENTS
    BARRETT, JC
    FINKERNAGEL, PR
    JOHNSON, DK
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 1990, 199 : 24 - CINF
  • [49] Evaluating US patent full text documents with chemical ontologies
    Weber, Lutz
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2015, 250
  • [50] The problem of automatic understanding of full text documents in information retrieval
    Zabezhailo, MI
    JOURNAL OF COMPUTER AND SYSTEMS SCIENCES INTERNATIONAL, 1998, 37 (05) : 822 - 830