Tokenization-based data augmentation for text classification

被引：1

作者：

Prakrankamanant, Patawee ^{[1
]}

Chuangsuwanich, Ekapol ^{[1
]}

机构：

[1] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand

来源：

2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022) | 2022年

关键词：

D O I：

10.1109/JCSSE54890.2022.9836268

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Tokenization is one of the most important data preprocessing steps in the text classification task and also one of the main contributing factors in the model performance. However, getting good tokenizations is non-trivial when the input is noisy, and is especially problematic for languages without an explicit word delimiter such as Thai. Therefore, we propose an alternative data augmentation method to improve the robustness of poor tokenization by using multiple tokenizations. We evaluate the performance of our algorithms on different Thai text classification datasets. The results suggest our augmentation scheme makes the model more robust to tokenization errors and can be combined well with other data augmentation schemes.

引用

页数：6

共 50 条

[1] Data Augmentation with Transformers for Text Classification
Medardo Tapia-Tellez, Jose
Jair Escalante, Hugo
[J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 247 - 259
[2] A Survey on Data Augmentation for Text Classification
Bayer, Markus
Kaufhold, Marc-Andre
Reuter, Christian
[J]. ACM COMPUTING SURVEYS, 2023, 55 (07)
[3] TextANN: An Improved Text Classification Model Based on Data Augmentation
Li, Hong
Yang, Xiaosheng
Yang, Guoqing
Ouyang, Xiaogang
Chen, Yu
Wang, Xueqing
[J]. 2018 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, BIG DATA AND BLOCKCHAIN (ICCBB 2018), 2018, : 160 - 163
[4] A Tokenization-Based Communication Architecture for HCE-Enabled NFC Services
Ozdenizci, Busra
Ok, Kerem
Coskun, Vedat
[J]. MOBILE INFORMATION SYSTEMS, 2016, 2016
[5] Hierarchical Data Augmentation and the Application in Text Classification
Yu, Shujuan
Yang, Jie
Liu, Danlei
Li, Runqi
Zhang, Yun
Zhao, Shengmei
[J]. IEEE ACCESS, 2019, 7 : 185476 - 185485
[6] Iterative Translation-Based Data Augmentation Method for Text Classification Tasks
Lee, Sangwon
Liu, Ling
Choi, Wonik
[J]. IEEE ACCESS, 2021, 9 : 160437 - 160445
[7] Improving Text Classification with Large Language Model-Based Data Augmentation
Zhao, Huanhuan
Chen, Haihua
Ruggles, Thomas A.
Feng, Yunhe
Singh, Debjani
Yoon, Hong-Jun
[J]. ELECTRONICS, 2024, 13 (13)
[8] AEDA: An Easier Data Augmentation Technique for Text Classification
Karimi, Akbar
Rossi, Leonardo
Prati, Andrea
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2748 - 2754
[9] Probabilistic Interpolation with Mixup Data Augmentation for Text Classification
Xu, Rongkang
Zhang, Yongcheng
Ren, Kai
Huang, Yu
Wei, Xiaomei
[J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14878 : 410 - 421
[10] Stochastic Tokenization with a Language Model for Neural Text Classification
Hiraoka, Tatsuya
Shindo, Hiroyuki
Matsumoto, Yuji
[J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1620 - 1629

← 1 2 3 4 5 →