Enhancing Text Classification Models with Generative AI-aided Data Augmentation

被引：1

作者：

Zhao, Huanhuan ^{[1
]}

Chen, Haihua ^{[2
]}

Yoon, Hong-Jun ^{[3
]}

机构：

[1] Univ Tennessee, Data Sci & Engn, Knoxville, TN 37996 USA

[2] Univ North Texas, Dept Informat Sci, Denton, TX USA

[3] Oak Ridge Natl Lab, Computat Sci & Engn Div, Oak Ridge, TN USA

来源：

2023 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE TESTING, AITEST | 2023年

关键词：

text classification; data augmentation; ChatGPT; imbalanced data; natural language processing; machine learning; artificial intelligence;

D O I：

10.1109/AITest58265.2023.00030

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This study investigated the potential of enhancing the performance of text classification by augmenting the training dataset with external knowledge samples generated by a generative AI, specifically ChatGPT. The study conducted experiments on three models - CNN, HiSAN, and BERT - using the Reuters dataset. First, the study evaluated the effectiveness of incorporating ChatGPT-generated samples and then analyzed the impact of various factors such as sample size, sample variability, and sample length on the models' performance by varying the number, variety, and length of the generated samples. The models were assessed using macro and micro-averaged scores, and the results revealed that the macro-averaged scores improved significantly across all three models, with the BERT model showing the greatest improvement (from 49.87% to 65.73% in macro F1 score). The study further found that adding 30 distinct samples produced better results than adding 6 duplicates of 5 samples, and samples with 150 and 256 words had similar performance, while those with 50 words performed slightly worse. These findings suggest that incorporating external knowledge samples generated by a generative AI is an effective approach to enhance text classification models' performance. The study also highlights that the variability of articles generated by ChatGPT positively impacted the models' accuracy, and longer synthesized texts convey more comprehensive information on the subjects, leading to higher classification accuracy scores. Additionally, we conducted a comparison between our results and those obtained from EDA, a widely used data augmentation generator. The findings clearly demonstrate that our method surpasses EDA and offers additional advantages by reducing computational costs and solving zero-shot problem. Our code is available on GitHub.(1)

引用

页码：138 / 145

页数：8

共 50 条

[31] Probabilistic Interpolation with Mixup Data Augmentation for Text Classification
Xu, Rongkang
Zhang, Yongcheng
Ren, Kai
Huang, Yu
Wei, Xiaomei
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14878 : 410 - 421
[32] AEDA: An Easier Data Augmentation Technique for Text Classification
Karimi, Akbar
Rossi, Leonardo
Prati, Andrea
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2748 - 2754
[33] Cancer classification with data augmentation based on generative adversarial networks
Wei, Kaimin
Li, Tianqi
Huang, Feiran
Chen, Jinpeng
He, Zefan
FRONTIERS OF COMPUTER SCIENCE, 2022, 16 (02)
[34] Cancer classification with data augmentation based on generative adversarial networks
Kaimin WEI
Tianqi LI
Feiran HUANG
Jinpeng CHEN
Zefan HE
Frontiers of Computer Science, 2022, 16 (02) : 69 - 79
[35] Cancer classification with data augmentation based on generative adversarial networks
Kaimin Wei
Tianqi Li
Feiran Huang
Jinpeng Chen
Zefan He
Frontiers of Computer Science, 2022, 16
[36] Generative Model based Data Augmentation for Special Person Classification
Guo, Zijie
Zhi, Rong
Zhang, Wuqaing
Wang, Baofeng
Fang, Zhijie
Kaiser, Vitali
Wiederer, Julian
Flohr, Fabian
2020 IEEE INTELLIGENT VEHICLES SYMPOSIUM (IV), 2020, : 1669 - 1675
[37] AI-Aided Analyses of Seizure and Interictal Phenotypes and Drug Responses in Epilepsy Models: Possibilities for Clinical Applications
Soltesz, Ivan
ANNALS OF NEUROLOGY, 2024, 96 : S303 - S303
[38] A Generative Data Augmentation Model for Enhancing Chinese Dialect Pronunciation Prediction
Lin, Chu-Cheng
Tsai, Richard Tzong-Han
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (04): : 1109 - 1117
[39] Generative dynamical models for classification of rsfMRI data
Huckins, Grace
Poldrack, Russell A.
NETWORK NEUROSCIENCE, 2024, 8 (04): : 1613 - 1633
[40] Enhancing X-ray Security Image Synthesis: Advanced Generative Models and Innovative Data Augmentation Techniques
Yagoub, Bilel
Kasem, Mahmoud SalahEldin
Kang, Hyun-Soo
APPLIED SCIENCES-BASEL, 2024, 14 (10):

← 1 2 3 4 5 →