Improving Text Classification with Large Language Model-Based Data Augmentation

被引:2
|
作者
Zhao, Huanhuan [1 ]
Chen, Haihua [2 ]
Ruggles, Thomas A. [3 ]
Feng, Yunhe [4 ]
Singh, Debjani [3 ]
Yoon, Hong-Jun [5 ]
机构
[1] Univ Tennessee, Data Sci & Engn, Knoxville, TN 37996 USA
[2] Univ North Texas, Dept Informat Sci, Denton, TX 76203 USA
[3] Oak Ridge Natl Lab, Environm Sci Div, Oak Ridge, TN 37830 USA
[4] Univ North Texas, Computat Sci & Engn, Denton, TX 76203 USA
[5] Oak Ridge Natl Lab, Computat Sci & Engn Div, Oak Ridge, TN 37830 USA
关键词
data augmentation; large language model; ChatGPT; imbalanced data; text classification; natural language processing; machine learning; artificial intelligence;
D O I
10.3390/electronics13132535
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing dataset with ChatGPT or generating entirely new data from scratch. However, it is unclear which method is better without comparing their effectiveness. This study investigates the application of both methods to two datasets: a general-topic dataset (Reuters news data) and a domain-specific dataset (Mitigation dataset). Our findings indicate that: 1. ChatGPT generated new data consistently enhanced model's classification results for both datasets. 2. Generating new data generally outperforms rewriting existing data, though crafting the prompts carefully is crucial to extract the most valuable information from ChatGPT, particularly for domain-specific data. 3. The augmentation data size affects the effectiveness of DA; however, we observed a plateau after incorporating 10 samples. 4. Combining the rewritten sample with new generated sample can potentially further improve the model's performance.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Null Model-Based Data Augmentation for Graph Classification
    Wang, Zeyu
    Wang, Jinhuan
    Shan, Yalu
    Yu, Shanqing
    Xu, Xiaoke
    Xuan, Qi
    Chen, Guanrong
    [J]. IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2024, 11 (02): : 1821 - 1833
  • [2] Language Model Data Augmentation Based on Text Domain Transfer
    Ogawa, Atsunori
    Tawara, Naohiro
    Delcroix, Marc
    [J]. INTERSPEECH 2020, 2020, : 4926 - 4930
  • [3] TextANN: An Improved Text Classification Model Based on Data Augmentation
    Li, Hong
    Yang, Xiaosheng
    Yang, Guoqing
    Ouyang, Xiaogang
    Chen, Yu
    Wang, Xueqing
    [J]. 2018 INTERNATIONAL CONFERENCE ON CLOUD COMPUTING, BIG DATA AND BLOCKCHAIN (ICCBB 2018), 2018, : 160 - 163
  • [4] Effective Use of Augmentation Degree and Language Model for Synonym-based Text Augmentation on Indonesian Text Classification
    Abdurrahman
    Purwarianti, Ayu
    [J]. 2019 11TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS 2019), 2019, : 217 - 222
  • [5] Data augmentation based on large language models for radiological report classification
    Collado-Montañez, Jaime
    Martín-Valdivia, María-Teresa
    Martínez-Cámara, Eugenio
    [J]. Knowledge-Based Systems, 2025, 308
  • [6] Diffusion Model-Based Data Augmentation for Lung Ultrasound Classification with Limited Data
    Zhang, Xiaohui
    Gangopadhyay, Ahana
    Chang, Hsi-Ming
    Soni, Ravi
    [J]. MACHINE LEARNING FOR HEALTH, ML4H, VOL 225, 2023, 225 : 664 - 676
  • [7] LiDA: Language-Independent Data Augmentation for Text Classification
    Sujana, Yudianto
    Kao, Hung-Yu
    [J]. IEEE ACCESS, 2023, 11 : 10894 - 10901
  • [8] RumorLLM: A Rumor Large Language Model-Based Fake-News-Detection Data-Augmentation Approach
    Lai, Jianqiao
    Yang, Xinran
    Luo, Wenyue
    Zhou, Linjiang
    Li, Langchen
    Wang, Yongqi
    Shi, Xiaochuan
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (08):
  • [9] Tokenization-based data augmentation for text classification
    Prakrankamanant, Patawee
    Chuangsuwanich, Ekapol
    [J]. 2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022), 2022,
  • [10] Improving Machine Learning Diagnostic Systems with Model-Based Data Augmentation - Part A: Data Generation
    Kahlen, Jannis Nikolas
    Wurde, Andre
    Andres, Michael
    Moser, Albert
    [J]. 2021 IEEE PES INNOVATIVE SMART GRID TECHNOLOGY EUROPE (ISGT EUROPE 2021), 2021, : 490 - 494