Tokenization-based data augmentation for text classification

被引:1
|
作者
Prakrankamanant, Patawee [1 ]
Chuangsuwanich, Ekapol [1 ]
机构
[1] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand
关键词
D O I
10.1109/JCSSE54890.2022.9836268
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Tokenization is one of the most important data preprocessing steps in the text classification task and also one of the main contributing factors in the model performance. However, getting good tokenizations is non-trivial when the input is noisy, and is especially problematic for languages without an explicit word delimiter such as Thai. Therefore, we propose an alternative data augmentation method to improve the robustness of poor tokenization by using multiple tokenizations. We evaluate the performance of our algorithms on different Thai text classification datasets. The results suggest our augmentation scheme makes the model more robust to tokenization errors and can be combined well with other data augmentation schemes.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Data Augmentation with Transformers for Text Classification
    Medardo Tapia-Tellez, Jose
    Jair Escalante, Hugo
    [J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 247 - 259
  • [2] A Survey on Data Augmentation for Text Classification
    Bayer, Markus
    Kaufhold, Marc-Andre
    Reuter, Christian
    [J]. ACM COMPUTING SURVEYS, 2023, 55 (07)
  • [3] TextANN: An Improved Text Classification Model Based on Data Augmentation
    Li, Hong
    Yang, Xiaosheng
    Yang, Guoqing
    Ouyang, Xiaogang
    Chen, Yu
    Wang, Xueqing
    [J]. 2018 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, BIG DATA AND BLOCKCHAIN (ICCBB 2018), 2018, : 160 - 163
  • [4] A Tokenization-Based Communication Architecture for HCE-Enabled NFC Services
    Ozdenizci, Busra
    Ok, Kerem
    Coskun, Vedat
    [J]. MOBILE INFORMATION SYSTEMS, 2016, 2016
  • [5] Hierarchical Data Augmentation and the Application in Text Classification
    Yu, Shujuan
    Yang, Jie
    Liu, Danlei
    Li, Runqi
    Zhang, Yun
    Zhao, Shengmei
    [J]. IEEE ACCESS, 2019, 7 : 185476 - 185485
  • [6] Iterative Translation-Based Data Augmentation Method for Text Classification Tasks
    Lee, Sangwon
    Liu, Ling
    Choi, Wonik
    [J]. IEEE ACCESS, 2021, 9 : 160437 - 160445
  • [7] Improving Text Classification with Large Language Model-Based Data Augmentation
    Zhao, Huanhuan
    Chen, Haihua
    Ruggles, Thomas A.
    Feng, Yunhe
    Singh, Debjani
    Yoon, Hong-Jun
    [J]. ELECTRONICS, 2024, 13 (13)
  • [8] AEDA: An Easier Data Augmentation Technique for Text Classification
    Karimi, Akbar
    Rossi, Leonardo
    Prati, Andrea
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2748 - 2754
  • [9] Probabilistic Interpolation with Mixup Data Augmentation for Text Classification
    Xu, Rongkang
    Zhang, Yongcheng
    Ren, Kai
    Huang, Yu
    Wei, Xiaomei
    [J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14878 : 410 - 421
  • [10] Stochastic Tokenization with a Language Model for Neural Text Classification
    Hiraoka, Tatsuya
    Shindo, Hiroyuki
    Matsumoto, Yuji
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 1620 - 1629