Enhancing Text Classification Models with Generative AI-aided Data Augmentation

被引:1
|
作者
Zhao, Huanhuan [1 ]
Chen, Haihua [2 ]
Yoon, Hong-Jun [3 ]
机构
[1] Univ Tennessee, Data Sci & Engn, Knoxville, TN 37996 USA
[2] Univ North Texas, Dept Informat Sci, Denton, TX USA
[3] Oak Ridge Natl Lab, Computat Sci & Engn Div, Oak Ridge, TN USA
关键词
text classification; data augmentation; ChatGPT; imbalanced data; natural language processing; machine learning; artificial intelligence;
D O I
10.1109/AITest58265.2023.00030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study investigated the potential of enhancing the performance of text classification by augmenting the training dataset with external knowledge samples generated by a generative AI, specifically ChatGPT. The study conducted experiments on three models - CNN, HiSAN, and BERT - using the Reuters dataset. First, the study evaluated the effectiveness of incorporating ChatGPT-generated samples and then analyzed the impact of various factors such as sample size, sample variability, and sample length on the models' performance by varying the number, variety, and length of the generated samples. The models were assessed using macro and micro-averaged scores, and the results revealed that the macro-averaged scores improved significantly across all three models, with the BERT model showing the greatest improvement (from 49.87% to 65.73% in macro F1 score). The study further found that adding 30 distinct samples produced better results than adding 6 duplicates of 5 samples, and samples with 150 and 256 words had similar performance, while those with 50 words performed slightly worse. These findings suggest that incorporating external knowledge samples generated by a generative AI is an effective approach to enhance text classification models' performance. The study also highlights that the variability of articles generated by ChatGPT positively impacted the models' accuracy, and longer synthesized texts convey more comprehensive information on the subjects, leading to higher classification accuracy scores. Additionally, we conducted a comparison between our results and those obtained from EDA, a widely used data augmentation generator. The findings clearly demonstrate that our method surpasses EDA and offers additional advantages by reducing computational costs and solving zero-shot problem. Our code is available on GitHub.(1)
引用
收藏
页码:138 / 145
页数:8
相关论文
共 50 条
  • [31] Probabilistic Interpolation with Mixup Data Augmentation for Text Classification
    Xu, Rongkang
    Zhang, Yongcheng
    Ren, Kai
    Huang, Yu
    Wei, Xiaomei
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14878 : 410 - 421
  • [32] AEDA: An Easier Data Augmentation Technique for Text Classification
    Karimi, Akbar
    Rossi, Leonardo
    Prati, Andrea
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2748 - 2754
  • [33] Cancer classification with data augmentation based on generative adversarial networks
    Wei, Kaimin
    Li, Tianqi
    Huang, Feiran
    Chen, Jinpeng
    He, Zefan
    FRONTIERS OF COMPUTER SCIENCE, 2022, 16 (02)
  • [34] Cancer classification with data augmentation based on generative adversarial networks
    Kaimin WEI
    Tianqi LI
    Feiran HUANG
    Jinpeng CHEN
    Zefan HE
    Frontiers of Computer Science, 2022, 16 (02) : 69 - 79
  • [35] Cancer classification with data augmentation based on generative adversarial networks
    Kaimin Wei
    Tianqi Li
    Feiran Huang
    Jinpeng Chen
    Zefan He
    Frontiers of Computer Science, 2022, 16
  • [36] Generative Model based Data Augmentation for Special Person Classification
    Guo, Zijie
    Zhi, Rong
    Zhang, Wuqaing
    Wang, Baofeng
    Fang, Zhijie
    Kaiser, Vitali
    Wiederer, Julian
    Flohr, Fabian
    2020 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2020, : 1669 - 1675
  • [38] A Generative Data Augmentation Model for Enhancing Chinese Dialect Pronunciation Prediction
    Lin, Chu-Cheng
    Tsai, Richard Tzong-Han
    IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04): : 1109 - 1117
  • [39] Generative dynamical models for classification of rsfMRI data
    Huckins, Grace
    Poldrack, Russell A.
    NETWORK NEUROSCIENCE, 2024, 8 (04): : 1613 - 1633
  • [40] Enhancing X-ray Security Image Synthesis: Advanced Generative Models and Innovative Data Augmentation Techniques
    Yagoub, Bilel
    Kasem, Mahmoud SalahEldin
    Kang, Hyun-Soo
    APPLIED SCIENCES-BASEL, 2024, 14 (10):