Bag of Words and Embedding Text Representation Methods for Medical Article Classification

被引:1
|
作者
Cichosz, Pawel [1 ]
机构
[1] Warsaw Univ Technol, Inst Comp Sci, Nowowiejska 15-19, Warsaw, Poland
关键词
text representation; text classification; bag of words; word embeddings; WORKLOAD;
D O I
10.34768/amcs-2023-0043
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.
引用
收藏
页码:603 / 621
页数:19
相关论文
共 50 条
  • [41] Fuzzy Bag-of-Words Model for Document Representation
    Zhao, Rui
    Mao, Kezhi
    IEEE TRANSACTIONS ON FUZZY SYSTEMS, 2018, 26 (02) : 794 - 804
  • [42] A Bag of Constrained Visual Words Model for Image Representation
    Mukherjee, Anindita
    Sil, Jaya
    Chowdhury, Ananda S.
    PROCEEDINGS OF 3RD INTERNATIONAL CONFERENCE ON COMPUTER VISION AND IMAGE PROCESSING, CVIP 2018, VOL 2, 2020, 1024 : 403 - 415
  • [43] The locally weighted bag of words framework for document representation
    Lebanon, Guy
    Mao, Yi
    Dillon, Joshua
    Journal of Machine Learning Research, 2007, 8 : 2405 - 2441
  • [44] Do Important Words in Bag-of-Words Model of Text Relatedness Help?
    Islam, Aminul
    Milios, Evangelos
    Keselj, Vlado
    TEXT, SPEECH, AND DIALOGUE (TSD 2015), 2015, 9302 : 569 - 577
  • [45] The locally weighted bag of words framework for document representation
    Lebanon, Guy
    Mao, Yi
    Dillon, Joshua
    JOURNAL OF MACHINE LEARNING RESEARCH, 2007, 8 : 2405 - 2441
  • [46] A Novel Codebook Representation Method and Encoding Strategy For Bag-of-Words Based Acoustic Event Classification
    Dai, Jia
    Ni, Chongjia
    Xue, Wei
    Liu, Wenju
    2015 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA), 2015, : 31 - 34
  • [47] Improving Text Classification with Word Embedding
    Ge, Lihao
    Moh, Teng-Sheng
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 1796 - 1805
  • [48] Adaptive Region Embedding for Text Classification
    Xiang, Liuyu
    Jin, Xiaoming
    Yi, Lan
    Ding, Guiguang
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 7314 - 7321
  • [49] In Defense of Word Embedding for Generic Text Representation
    Lev, Guy
    Klein, Benjamin
    Wolf, Lior
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, NLDB 2015, 2015, 9103 : 35 - 50
  • [50] A Survey of Text Representation and Embedding Techniques in NLP
    Patil, Rajvardhan
    Boit, Sorio
    Gudivada, Venkat
    Nandigam, Jagadeesh
    IEEE ACCESS, 2023, 11 : 36120 - 36146