Tokenization-based data augmentation for text classification

被引:1
|
作者
Prakrankamanant, Patawee [1 ]
Chuangsuwanich, Ekapol [1 ]
机构
[1] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand
关键词
D O I
10.1109/JCSSE54890.2022.9836268
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Tokenization is one of the most important data preprocessing steps in the text classification task and also one of the main contributing factors in the model performance. However, getting good tokenizations is non-trivial when the input is noisy, and is especially problematic for languages without an explicit word delimiter such as Thai. Therefore, we propose an alternative data augmentation method to improve the robustness of poor tokenization by using multiple tokenizations. We evaluate the performance of our algorithms on different Thai text classification datasets. The results suggest our augmentation scheme makes the model more robust to tokenization errors and can be combined well with other data augmentation schemes.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] Data augmentation strategies to improve text classification: a use case in smart cities
    Bencke, Luciana
    Moreira, Viviane Pereira
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 659 - 694
  • [32] Enhancing Text Classification Models with Generative AI-aided Data Augmentation
    Zhao, Huanhuan
    Chen, Haihua
    Yoon, Hong-Jun
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING, AITEST, 2023, : 138 - 145
  • [33] Data augmentation using virtual word insertion techniques in text classification tasks
    Long, Zhigao
    Li, Hong
    Shi, Jiawen
    Ma, Xin
    [J]. EXPERT SYSTEMS, 2024, 41 (04)
  • [34] EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
    Wei, Jason
    Zou, Kai
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 6382 - 6388
  • [35] Data Augmentation Using Transformers and Similarity Measures for Improving Arabic Text Classification
    Refai, Dania
    Abu-Soud, Saleh
    Abdel-Rahman, Mohammad J.
    [J]. IEEE ACCESS, 2023, 11 : 132516 - 132531
  • [36] Data augmentation strategies to improve text classification: a use case in smart cities
    Bencke, Luciana
    Moreira, Viviane Pereira
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2023,
  • [37] CHARCNN-SVM FOR CHINESE TEXT DATASETS SENTIMENT CLASSIFICATION WITH DATA AUGMENTATION
    Wang, Xingkai
    Sheng, Yiqiang
    Deng, Haojiang
    Zhao, Zhenyu
    [J]. INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2019, 15 (01): : 227 - 246
  • [38] Supervised Contrast Learning Text Classification Model Based on DataQuality Augmentation
    Wu, Liang
    Zhang, Fangfang
    Cheng, Chao
    Song, Shinan
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)
  • [39] Contrastive learning based on linguistic knowledge and adaptive augmentation for text classification
    Zhang, Shaokang
    Ran, Ning
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [40] Graph-based Text Classification by Contrastive Learning with Text-level Graph Augmentation
    Li, Ximing
    Wang, Bing
    Wang, Yang
    Wang, Meng
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2024, 18 (04)