Bag of Words and Embedding Text Representation Methods for Medical Article Classification

被引:1
|
作者
Cichosz, Pawel [1 ]
机构
[1] Warsaw Univ Technol, Inst Comp Sci, Nowowiejska 15-19, Warsaw, Poland
关键词
text representation; text classification; bag of words; word embeddings; WORKLOAD;
D O I
10.34768/amcs-2023-0043
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.
引用
收藏
页码:603 / 621
页数:19
相关论文
共 50 条
  • [11] Bag-of-words representation for biomedical time series classification
    Wang, Jin
    Liu, Ping
    She, Mary F. H.
    Nahavandi, Saeid
    Kouzani, Abbas
    BIOMEDICAL SIGNAL PROCESSING AND CONTROL, 2013, 8 (06) : 634 - 644
  • [12] Bag-of-Concepts Document Representation for Bayesian Text Classification
    Mourino-Garcia, Marcos
    Perez-Rodriguez, Roberto
    Anido-Rifon, Luis
    Gomez-Carballa, Miguel
    2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (CIT), 2016, : 281 - 288
  • [13] Active Learning for Biomedical Article Classification with Bag of Words and FastText Embeddings
    Cichosz, Pawel
    APPLIED SCIENCES-BASEL, 2024, 14 (17):
  • [14] Network-Based Bag-of-Words Model for Text Classification
    Yan, Dongyang
    Li, Keping
    Gu, Shuang
    Yang, Liu
    IEEE ACCESS, 2020, 8 : 82641 - 82652
  • [15] Ensemble Bag-of-Audio-Words Representation Improves Paralinguistic Classification Accuracy
    Gosztolya, Gabor
    Busa-Fekete, Robert
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2021, 29 : 477 - 488
  • [16] Fusion of Bag-of-Words Models for Image Classification in the Medical Domain
    Valavanis, Leonidas
    Stathopoulos, Spyridon
    Kalamboukis, Theodore
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2017, 2017, 10193 : 134 - 145
  • [17] News Article Classification Based on a Vector Representation Including Words' Collocations
    Kompan, Michal
    Bielikova, Maria
    THIRD INTERNATIONAL CONFERENCE ON SOFTWARE, SERVICES AND SEMANTIC TECHNOLOGIES S3T 2011, 2011, 101 : 1 - 8
  • [18] Document Embedding based Supervised Methods for Turkish Text Classification
    Celenli, Halil I.
    Ozturk, S. Talha
    Sahin, Gurkan
    Gerek, Aydin
    Ganiz, Murat C.
    2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2018, : 477 - 482
  • [19] Word embedding and text classification based on deep learning methods
    Li, Saihan
    Gong, Bing
    2020 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE COMMUNICATION AND NETWORK SECURITY (CSCNS2020), 2021, 336