Tokenization-based data augmentation for text classification

被引:1
|
作者
Prakrankamanant, Patawee [1 ]
Chuangsuwanich, Ekapol [1 ]
机构
[1] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand
关键词
D O I
10.1109/JCSSE54890.2022.9836268
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Tokenization is one of the most important data preprocessing steps in the text classification task and also one of the main contributing factors in the model performance. However, getting good tokenizations is non-trivial when the input is noisy, and is especially problematic for languages without an explicit word delimiter such as Thai. Therefore, we propose an alternative data augmentation method to improve the robustness of poor tokenization by using multiple tokenizations. We evaluate the performance of our algorithms on different Thai text classification datasets. The results suggest our augmentation scheme makes the model more robust to tokenization errors and can be combined well with other data augmentation schemes.
引用
收藏
页数:6
相关论文
共 50 条
  • [41] Robust scientific text classification using prompt tuning based on data augmentation with L2 regularization
    Shi, Shijun
    Hu, Kai
    Xie, Jie
    Guo, Ya
    Wu, Huayi
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (01)
  • [42] Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification
    Ren, Shuhuai
    Zhang, Jinchao
    Li, Lei
    Sun, Xu
    Zhou, Jie
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9029 - 9043
  • [43] Language Model Data Augmentation Based on Text Domain Transfer
    Ogawa, Atsunori
    Tawara, Naohiro
    Delcroix, Marc
    [J]. INTERSPEECH 2020, 2020, : 4926 - 4930
  • [44] Improving Utterance Rewriter Based on MMI and Text Data Augmentation
    Yang, Lina
    Lin, Hai
    Li, Wei
    Meng, Zuqiang
    Wang, Patrick Shen-Pei
    Li, Xichun
    Luo, Huiwu
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2022, 36 (04)
  • [45] Text Summarization Based on Conceptual Data Classification
    AlJa'am, Jihad M.
    Jaoua, Ali M.
    Hasnah, Ahmad M.
    Hassan, F.
    Mohamed, H.
    Mosaid, T.
    Saleh, H.
    Abdullah, F.
    Cherif, H.
    [J]. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND WEB ENGINEERING, 2006, 1 (04) : 22 - 36
  • [46] Feature-based augmentation and classification for tabular data
    Sathianarayanan, Balachander
    Samant, Yogesh Chandra Singh
    Guruprasad, Prahalad S. Conjeepuram
    Hariharan, Varshin B.
    Manickam, Nirmala Devi
    [J]. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2022, 7 (03) : 481 - 491
  • [47] CycleGAN Based Data Augmentation For Melanoma images Classification
    Chen, Yixin
    Zhu, Yifan
    Chang, Yanfeng
    [J]. AIPR 2020: 2020 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND PATTERN RECOGNITION, 2020, : 115 - 119
  • [48] Data Augmentation Based on DiscrimDiff for Histopathology Image Classification
    Guan, Xianchao
    Wang, Yifeng
    Lin, Yiyang
    Zhang, Yongbing
    [J]. DATA AUGMENTATION, LABELLING, AND IMPERFECTIONS, DALI 2023, 2024, 14379 : 53 - 62
  • [49] Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
    Rehman, Zobia
    Anwar, Waqas
    Bajwa, Usama Ijaz
    Wang Xuan
    Zhou Chaoying
    [J]. PLOS ONE, 2013, 8 (08):
  • [50] Automatic modulation classification based on AlexNet with data augmentation
    Chengchang, Zhang
    Yu, Xu
    Jianpeng, Yang
    Xiaomeng, Li
    [J]. Journal of China Universities of Posts and Telecommunications, 2022, 29 (05): : 51 - 61