Enhancing Text Classification Models with Generative AI-aided Data Augmentation

被引:1
|
作者
Zhao, Huanhuan [1 ]
Chen, Haihua [2 ]
Yoon, Hong-Jun [3 ]
机构
[1] Univ Tennessee, Data Sci & Engn, Knoxville, TN 37996 USA
[2] Univ North Texas, Dept Informat Sci, Denton, TX USA
[3] Oak Ridge Natl Lab, Computat Sci & Engn Div, Oak Ridge, TN USA
关键词
text classification; data augmentation; ChatGPT; imbalanced data; natural language processing; machine learning; artificial intelligence;
D O I
10.1109/AITest58265.2023.00030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study investigated the potential of enhancing the performance of text classification by augmenting the training dataset with external knowledge samples generated by a generative AI, specifically ChatGPT. The study conducted experiments on three models - CNN, HiSAN, and BERT - using the Reuters dataset. First, the study evaluated the effectiveness of incorporating ChatGPT-generated samples and then analyzed the impact of various factors such as sample size, sample variability, and sample length on the models' performance by varying the number, variety, and length of the generated samples. The models were assessed using macro and micro-averaged scores, and the results revealed that the macro-averaged scores improved significantly across all three models, with the BERT model showing the greatest improvement (from 49.87% to 65.73% in macro F1 score). The study further found that adding 30 distinct samples produced better results than adding 6 duplicates of 5 samples, and samples with 150 and 256 words had similar performance, while those with 50 words performed slightly worse. These findings suggest that incorporating external knowledge samples generated by a generative AI is an effective approach to enhance text classification models' performance. The study also highlights that the variability of articles generated by ChatGPT positively impacted the models' accuracy, and longer synthesized texts convey more comprehensive information on the subjects, leading to higher classification accuracy scores. Additionally, we conducted a comparison between our results and those obtained from EDA, a widely used data augmentation generator. The findings clearly demonstrate that our method surpasses EDA and offers additional advantages by reducing computational costs and solving zero-shot problem. Our code is available on GitHub.(1)
引用
收藏
页码:138 / 145
页数:8
相关论文
共 50 条
  • [21] Continous AI-aided learning to establish the digital twin models for predicting the individual reliability characteristics
    Yuan, Cadmus
    2022 23RD INTERNATIONAL CONFERENCE ON THERMAL, MECHANICAL AND MULTI-PHYSICS SIMULATION AND EXPERIMENTS IN MICROELECTRONICS AND MICROSYSTEMS (EUROSIME), 2022,
  • [22] Can Pretrained Models Really Learn Better Molecular Representations for AI-Aided Drug Discovery?
    Zhang, Ziqiao
    Bian, Yatao
    Xie, Ailin
    Han, Pengju
    Zhou, Shuigeng
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2023, 64 (07) : 2921 - 2930
  • [23] Data Augmentation for Sparse Multidimensional Learning Performance Data Using Generative AI
    Zhang, Liang
    Lin, Jionghao
    Sabatini, John
    Borchers, Conrad
    Weitekamp, Daniel
    Cao, Meng
    Hollander, John
    Hu, Xiangen
    Graesser, Arthur C.
    IEEE TRANSACTIONS ON LEARNING TECHNOLOGIES, 2025, 18 : 145 - 164
  • [24] Generative AI-aided Joint Training-free Secure Semantic Communications via Multi-modal Prompts
    Du, Hongyang
    Liu, Guangyuan
    Niyato, Dusit
    Zhang, Jiayi
    Kang, Jiawen
    Xiong, Zehui
    Ai, Bo
    Kim, Dong In
    2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), 2024, : 12896 - 12900
  • [25] Driving Safety Area Classification for Automated Vehicles Based on Data Augmentation Using Generative Models
    Lee, Donghoun
    SUSTAINABILITY, 2024, 16 (11)
  • [26] Text data augmentation and pre-trained Language Model for enhancing text classification of low-resource languages
    Ziyaden, Atabay
    Yelenov, Amir
    Hajiyev, Fuad
    Rustamov, Samir
    Pak, Alexandr
    PEERJ COMPUTER SCIENCE, 2024, 10
  • [27] Copyright, text & data mining and the innovation dimension of generative AI
    Tyagi, Kalpana
    JOURNAL OF INTELLECTUAL PROPERTY LAW & PRACTICE, 2024, 19 (07) : 557 - 570
  • [28] Generative Adversarial Network (GAN) Based Data Augmentation for Enhancing DL Models on Facade Defect Identification
    Kiper, Beyza
    Gokhale, Savani
    Ergan, Semiha
    COMPUTING IN CIVIL ENGINEERING 2023-DATA, SENSING, AND ANALYTICS, 2024, : 202 - 209
  • [29] Semantic Data Augmentation for Deep Learning Testing using Generative AI
    Missaoui, Sondess
    Gerasimou, Simos
    Matragkas, Nicholas
    2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 1694 - 1698
  • [30] Tokenization-based data augmentation for text classification
    Prakrankamanant, Patawee
    Chuangsuwanich, Ekapol
    2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022), 2022,