Classification of Full Text Biomedical Documents: Sections Importance Assessment

被引:3
|
作者
Oliveira Goncalves, Carlos Adriano [1 ,2 ,3 ,6 ]
Camacho, Rui [4 ]
Goncalves, Celia Talma [5 ]
Seara Vieira, Adrian [1 ,2 ,3 ]
Borrajo Diz, Lourdes [1 ,2 ,3 ]
Lorenzo Iglesias, Eva [1 ,2 ,3 ]
机构
[1] Univ Vigo, Dept Comp Sci, Escuela Super Ingn Informat, Orense 32004, Spain
[2] Univ Vigo, CINBIO Biomed Res Ctr, Vigo 36310, Spain
[3] SERGAS UVIGO, Galicia Sur Hlth Res Inst IIS Galicia Sur, SING Res Grp, Vigo 36310, Spain
[4] Univ Porto, LIAAD INESC TEC, Fac Engn, P-4200465 Porto, Portugal
[5] LIACC, CEOS PP, ISCAP P PORTO, Campus FEUP, P-436900 Porto, Portugal
[6] Escuela Super Ingn Informat, Orense 32004, Spain
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 06期
关键词
full text classification; preprocessing techniques; section weighing scheme; information retrieval;
D O I
10.3390/app11062674
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The exponential growth of documents in the web makes it very hard for researchers to be aware of the relevant work being done within the scientific community. The task of efficiently retrieving information has therefore become an important research topic. The objective of this study is to test how the efficiency of the text classification changes if different weights are previously assigned to the sections that compose the documents. The proposal takes into account the place (section) where terms are located in the document, and each section has a weight that can be modified depending on the corpus. To carry out the study, an extended version of the OHSUMED corpus with full documents have been created. Through the use of WEKA, we compared the use of abstracts only with that of full texts, as well as the use of section weighing combinations to assess their significance in the scientific article classification process using the SMO (Sequential Minimal Optimization), the WEKA Support Vector Machine (SVM) algorithm implementation. The experimental results show that the proposed combinations of the preprocessing techniques and feature selection achieve promising results for the task of full text scientific document classification. We also have evidence to conclude that enriched datasets with text from certain sections achieve better results than using only titles and abstracts.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] MeSHup: A Corpus for Full Text Biomedical Document Indexing
    Wang, Xindi
    Mercer, Robert E.
    Rudzicz, Frank
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5473 - 5483
  • [32] The Importance of preprocessing in Turkish Text Classification
    Acikalin, Buse
    Bayazit, Nilgun Guler
    2016 24TH SIGNAL PROCESSING AND COMMUNICATION APPLICATION CONFERENCE (SIU), 2016, : 2053 - 2056
  • [33] Predicting substantive biomedical citations without full text
    Hoppe, Travis A.
    Arabi, Salsabil
    Hutchins, B. Ian
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2023, 120 (30)
  • [34] A Text-Mining System for Concept Annotation in Biomedical Full Text Articles
    Wei, Chih-Hsuan
    Allot, Alexis
    Leaman, Robert
    Lu, Zhiyong
    ACM-BCB'19: PROCEEDINGS OF THE 10TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND HEALTH INFORMATICS, 2019, : 540 - 540
  • [35] Classification of Text Documents Based on a Probabilistic Topic Model
    Karpovich, S. N.
    Smirnov, A. V.
    Teslya, N. N.
    SCIENTIFIC AND TECHNICAL INFORMATION PROCESSING, 2019, 46 (05) : 314 - 320
  • [36] Distributed Classification of Text Documents on Apache Spark Platform
    Semberecki, Piotr
    Maciejewski, Henryk
    ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2016, 2016, 9692 : 621 - 630
  • [37] SEGMENTATION AND CLASSIFICATION OF MIXED TEXT/GRAPHICS/IMAGE DOCUMENTS
    FAN, KC
    LIU, CH
    WANG, YK
    PATTERN RECOGNITION LETTERS, 1994, 15 (12) : 1201 - 1209
  • [38] Classification of Text Documents Based on a Probabilistic Topic Model
    S. N. Karpovich
    A. V. Smirnov
    N. N. Teslya
    Scientific and Technical Information Processing, 2019, 46 : 314 - 320
  • [39] Text Classification of Judgement Documents Considering Sample Imbalance
    Yang, Zhaoxu
    Ge, Jike
    Hu, Tingkai
    Yu, Wencheng
    Zheng, Yujie
    Dong, Yan
    2022 IEEE 17TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2022, : 1459 - 1462
  • [40] Feature selection and text classification for Chinese web documents
    Xu, JC
    Liu, DY
    Hu, M
    PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2004, : 1304 - 1309