Tokenization-based data augmentation for text classification

被引：1

作者：

Prakrankamanant, Patawee ^{[1
]}

Chuangsuwanich, Ekapol ^{[1
]}

机构：

[1] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand

来源：

2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022) | 2022年

关键词：

D O I：

10.1109/JCSSE54890.2022.9836268

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Tokenization is one of the most important data preprocessing steps in the text classification task and also one of the main contributing factors in the model performance. However, getting good tokenizations is non-trivial when the input is noisy, and is especially problematic for languages without an explicit word delimiter such as Thai. Therefore, we propose an alternative data augmentation method to improve the robustness of poor tokenization by using multiple tokenizations. We evaluate the performance of our algorithms on different Thai text classification datasets. The results suggest our augmentation scheme makes the model more robust to tokenization errors and can be combined well with other data augmentation schemes.

引用

页数：6

共 50 条

[31] Data augmentation strategies to improve text classification: a use case in smart cities
Bencke, Luciana
Moreira, Viviane Pereira
[J]. LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 659 - 694
[32] Enhancing Text Classification Models with Generative AI-aided Data Augmentation
Zhao, Huanhuan
Chen, Haihua
Yoon, Hong-Jun
[J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING, AITEST, 2023, : 138 - 145
[33] Data augmentation using virtual word insertion techniques in text classification tasks
Long, Zhigao
Li, Hong
Shi, Jiawen
Ma, Xin
[J]. EXPERT SYSTEMS, 2024, 41 (04)
[34] EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
Wei, Jason
Zou, Kai
[J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 6382 - 6388
[35] Data Augmentation Using Transformers and Similarity Measures for Improving Arabic Text Classification
Refai, Dania
Abu-Soud, Saleh
Abdel-Rahman, Mohammad J.
[J]. IEEE ACCESS, 2023, 11 : 132516 - 132531
[36] Data augmentation strategies to improve text classification: a use case in smart cities
Bencke, Luciana
Moreira, Viviane Pereira
[J]. LANGUAGE RESOURCES AND EVALUATION, 2023,
[37] CHARCNN-SVM FOR CHINESE TEXT DATASETS SENTIMENT CLASSIFICATION WITH DATA AUGMENTATION
Wang, Xingkai
Sheng, Yiqiang
Deng, Haojiang
Zhao, Zhenyu
[J]. INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2019, 15 (01): : 227 - 246
[38] Supervised Contrast Learning Text Classification Model Based on DataQuality Augmentation
Wu, Liang
Zhang, Fangfang
Cheng, Chao
Song, Shinan
[J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)
[39] Contrastive learning based on linguistic knowledge and adaptive augmentation for text classification
Zhang, Shaokang
Ran, Ning
[J]. KNOWLEDGE-BASED SYSTEMS, 2024, 300
[40] Graph-based Text Classification by Contrastive Learning with Text-level Graph Augmentation
Li, Ximing
Wang, Bing
Wang, Yang
Wang, Meng
[J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2024, 18 (04)

← 1 2 3 4 5 →