Bag of Words and Embedding Text Representation Methods for Medical Article Classification

被引:1
|
作者
Cichosz, Pawel [1 ]
机构
[1] Warsaw Univ Technol, Inst Comp Sci, Nowowiejska 15-19, Warsaw, Poland
关键词
text representation; text classification; bag of words; word embeddings; WORKLOAD;
D O I
10.34768/amcs-2023-0043
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.
引用
收藏
页码:603 / 621
页数:19
相关论文
共 50 条
  • [1] The influence of preprocessing on text classification using a bag-of-words representation
    HaCohen-Kerner, Yaakov
    Miller, Daniel
    Yigal, Yair
    PLOS ONE, 2020, 15 (05):
  • [2] A New Text Representation Scheme Combining Bag-of-Words and Bag-of-Concepts Approaches for Automatic Text Classification
    Alahmadi, Alaa
    Joorabchi, Arash
    Mahdi, Abdulhussain E.
    2013 7TH IEEE GCC CONFERENCE AND EXHIBITION (GCC), 2013, : 108 - 113
  • [3] Clinical Text Classification with Word Embedding Features vs. Bag-of-Words Features
    Shao, Yijun
    Taylor, Stephanie
    Marshall, Nell
    Morioka, Craig
    Zeng-Treitler, Qing
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 2874 - 2878
  • [4] EXPANDED BAG OF WORDS REPRESENTATION FOR OBJECT CLASSIFICATION
    Liu, Tinglin
    Liu, Jing
    Liu, Qinshan
    Lu, Hanqing
    2009 16TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-6, 2009, : 297 - 300
  • [5] Towards Visual Words to Words Text Detection with a General Bag of Words Representation
    Mehta, Rakesh
    Chum, Ondrej
    Matas, Jiri
    2015 13TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2015, : 641 - 645
  • [6] Beyond the bag of words: A text representation for sentence selection
    Caropreso, Maria Fernanda
    Matwin, Stan
    ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4013 : 324 - 335
  • [7] Joint Embedding of Words and Labels for Text Classification
    Wang, Guoyin
    Li, Chunyuan
    Wang, Wenlin
    Zhang, Yizhe
    Shen, Dinghan
    Zhang, Xinyuan
    Henao, Ricardo
    Carin, Lawrence
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 2321 - 2331
  • [8] Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers
    Frederico Dias Souza
    João Baptista de Oliveira e Souza Filho
    Neural Computing and Applications, 2023, 35 : 9393 - 9406
  • [9] Embedding generation for text classification of Brazilian Portuguese user reviews: from bag-of-words to transformers
    Souza, Frederico Dias
    Filho, Joao Baptista de Oliveira e Souza
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (13): : 9393 - 9406
  • [10] Albanian Text Classification: Bag of Words Model and Word Analogies
    Kadriu, Arbana
    Abazi, Lejla
    Abazi, Hyrije
    BUSINESS SYSTEMS RESEARCH JOURNAL, 2019, 10 (01): : 74 - 87