Enhancing Text Classification Models with Generative AI-aided Data Augmentation

被引:1
|
作者
Zhao, Huanhuan [1 ]
Chen, Haihua [2 ]
Yoon, Hong-Jun [3 ]
机构
[1] Univ Tennessee, Data Sci & Engn, Knoxville, TN 37996 USA
[2] Univ North Texas, Dept Informat Sci, Denton, TX USA
[3] Oak Ridge Natl Lab, Computat Sci & Engn Div, Oak Ridge, TN USA
关键词
text classification; data augmentation; ChatGPT; imbalanced data; natural language processing; machine learning; artificial intelligence;
D O I
10.1109/AITest58265.2023.00030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study investigated the potential of enhancing the performance of text classification by augmenting the training dataset with external knowledge samples generated by a generative AI, specifically ChatGPT. The study conducted experiments on three models - CNN, HiSAN, and BERT - using the Reuters dataset. First, the study evaluated the effectiveness of incorporating ChatGPT-generated samples and then analyzed the impact of various factors such as sample size, sample variability, and sample length on the models' performance by varying the number, variety, and length of the generated samples. The models were assessed using macro and micro-averaged scores, and the results revealed that the macro-averaged scores improved significantly across all three models, with the BERT model showing the greatest improvement (from 49.87% to 65.73% in macro F1 score). The study further found that adding 30 distinct samples produced better results than adding 6 duplicates of 5 samples, and samples with 150 and 256 words had similar performance, while those with 50 words performed slightly worse. These findings suggest that incorporating external knowledge samples generated by a generative AI is an effective approach to enhance text classification models' performance. The study also highlights that the variability of articles generated by ChatGPT positively impacted the models' accuracy, and longer synthesized texts convey more comprehensive information on the subjects, leading to higher classification accuracy scores. Additionally, we conducted a comparison between our results and those obtained from EDA, a widely used data augmentation generator. The findings clearly demonstrate that our method surpasses EDA and offers additional advantages by reducing computational costs and solving zero-shot problem. Our code is available on GitHub.(1)
引用
收藏
页码:138 / 145
页数:8
相关论文
共 50 条
  • [41] Enhancing children's understanding of algorithmic biases in and with text-to-image generative AI
    Vartiainen, Henriikka
    Kahila, Juho
    Tedre, Matti
    Lopez-Pernas, Sonsoles
    Pope, Nicolas
    NEW MEDIA & SOCIETY, 2024,
  • [42] Enhancing Audio Classification Through MFCC Feature Extraction and Data Augmentation with CNN and RNN Models
    Rezaul, Karim Mohammed
    Jewel, Md
    Islam, Md Shabiul
    Siddiquee, Kazy Noor E. Alam
    Barua, Nick
    Rahman, Muhammad Azizur
    Shan-A-Khuda, Mohammad
    Bin Sulaiman, Rejwan
    Shaikh, Md Sadeque Imam
    Hamim, Md Abrar
    Tanmoy, F. M.
    Ul Haque, Afraz
    Nipun, Musarrat Saberin
    Dorudian, Navid
    Kareem, Amer
    Farid, Ahmed Khondokar
    Mubarak, Asma
    Jannat, Tajnuva
    Asha, Umme Fatema Tuj
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (07) : 37 - 53
  • [43] Deep Generative Models for Data Synthesis and Augmentation in Machine Learning
    Adavala, Kiran Mayee
    Vhatkar, Sangeeta
    Ruprah, Taranpreet Singh
    Bhatia, Sukhwinder Kaur
    Kumar, Vipin
    Sharma, Dharmendra
    Praveen, B. Shyam
    JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (03) : 1242 - 1249
  • [44] Data augmentation using generative models for track intrusion detection
    Lee, Soohyung
    Kim, Beomseong
    Lee, Heesung
    SCIENCE PROGRESS, 2023, 106 (04)
  • [45] Generative AI-Driven Data Augmentation for Crack Detection in Physical Structures
    Kim, Jinwook
    Seon, Joonho
    Kim, Soohyun
    Sun, Youngghyu
    Lee, Seongwoo
    Kim, Jeongho
    Hwang, Byungsun
    Kim, Jinyoung
    ELECTRONICS, 2024, 13 (19)
  • [46] Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks
    Wu, Xing
    Gao, Chaochen
    Lin, Meng
    Zang, Liangjun
    Hu, Songlin
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 871 - 875
  • [47] Enhancing medical text classification with GAN-based data augmentation and multi-task learning in BERT
    Xinping Chen
    Yan Du
    Scientific Reports, 15 (1)
  • [48] Conditional Generative Adversarial Networks for Data Augmentation in Breast Cancer Classification
    Wong, Weng San
    Amer, Mohammed
    Maul, Tomas
    Liao, Iman Yi
    Ahmed, Amr
    RECENT ADVANCES ON SOFT COMPUTING AND DATA MINING (SCDM 2020), 2020, 978 : 392 - 402
  • [49] Data Augmentation Based on Generative Adversarial Networks for Endoscopic Image Classification
    Park, Hyun-Cheol
    Hong, In-Pyo
    Poudel, Sahadev
    Choi, Chang
    IEEE ACCESS, 2023, 11 : 49216 - 49225
  • [50] Data Augmentation Using Generative Adversarial Network for Environmental Sound Classification
    Madhu, Aswathy
    Kumaraswamy, Suresh
    2019 27TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2019,