Bag of Words and Embedding Text Representation Methods for Medical Article Classification

被引：1

作者：

Cichosz, Pawel ^{[1
]}

机构：

[1] Warsaw Univ Technol, Inst Comp Sci, Nowowiejska 15-19, Warsaw, Poland

来源：

INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE | 2023年 / 33卷 / 04期

关键词：

text representation; text classification; bag of words; word embeddings; WORKLOAD;

D O I：

10.34768/amcs-2023-0043

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Text classification has become a standard component of automated systematic literature review (SLR) solutions, where articles are classified as relevant or irrelevant to a particular literature study topic. Conventional machine learning algorithms for tabular data which can learn quickly from not necessarily large and usually imbalanced data with low computational demands are well suited to this application, but they require that the text data be transformed to a vector representation. This work investigates the utility of different types of text representations for this purpose. Experiments are presented using the bag of words representation and selected representations based on word or text embeddings: word2vec, doc2vec, GloVe, fastText, Flair, and BioBERT. Four classification algorithms are used with these representations: a naive Bayes classifier, logistic regression, support vector machines, and random forest. They are applied to datasets consisting of scientific article abstracts from systematic literature review studies in the medical domain and compared with the pre-trained BioBERT model fine-tuned for classification. The obtained results confirm that the choice of text representation is essential for successful text classification. It turns out that, while the standard bag of words representation is hard to beat, fastText word embeddings make it possible to achieve roughly the same level of classification quality with the added benefit of much lower dimensionality and capability of handling out-of-vocabulary words. More refined embeddings methods based on deep neural networks, while much more demanding computationally, do not appear to offer substantial advantages for the classification task. The fine-tuned BioBERT classification model performs on par with conventional algorithms when they are coupled with their best text representation methods.

引用

页码：603 / 621

页数：19

共 50 条

[21] Hybrid embedding-based text representation for hierarchical multi-label text classification
Ma, Yinglong
Liu, Xiaofeng
Zhao, Lijiao
Liang, Yue
Zhang, Peng
Jin, Beihong
EXPERT SYSTEMS WITH APPLICATIONS, 2022, 187
[22] A Novel Feature Hashing With Efficient Collision Resolution for Bag-of-Words Representation of Text Data
Eclarin, Bobby A.
Fajardo, Arnel C.
Medina, Ruji P.
PROCEEDINGS OF THE 2018 2ND INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND INFORMATION RETRIEVAL (NLPIR 2018), 2018, : 12 - 16
[23] Bag of Embedded Words Learning for Text Retrieval
Passalis, Nikolaos
Tefas, Anastasios
2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 2416 - 2421
[24] Embedding representation of words in sign language
Czajka, Marcin Mateusz
Kubacka, Daria
Swietlicka, Aleksandra
JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, 2025, 465
[25] A Hybrid Classification Method via Character Embedding in Chinese Short Text With Few Words
Zhu, Yi
Li, Yun
Yue, Yongzheng
Qiang, Jipeng
Yuan, Yunhao
IEEE ACCESS, 2020, 8 : 92120 - 92128
[26] Syntactic-Aware Text Classification Method Embedding the Weight Vectors of Feature Words
Wang, Meng
Kim, Jisu
Yan, Yuting
IEEE ACCESS, 2025, 13 : 37572 - 37590
[27] A CASE STUDY IN TEXT MINING OF DISCUSSION FORUM POSTS: CLASSIFICATION WITH BAG OF WORDS AND GLOBAL VECTORS
Cichosz, Pawel
INTERNATIONAL JOURNAL OF APPLIED MATHEMATICS AND COMPUTER SCIENCE, 2018, 28 (04) : 787 - 801
[28] Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words for Predicting Medical Codes
Yogarajan, Vithya
Gouk, Henry
Smith, Tony
Mayo, Michael
Pfahringer, Bernhard
INTELLIGENT INFORMATION AND DATABASE SYSTEMS (ACIIDS 2020), PT I, 2020, 12033 : 97 - 108
[29] A text sentiment classification model using double word embedding methods
Mingqiang Zhou
Dan Liu
Yanhui Zheng
Qingsheng Zhu
Ping Guo
Multimedia Tools and Applications, 2022, 81 : 18993 - 19012
[30] A text sentiment classification model using double word embedding methods
Zhou, Mingqiang
Liu, Dan
Zheng, Yanhui
Zhu, Qingsheng
Guo, Ping
MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (14) : 18993 - 19012

← 1 2 3 4 5 →