Improving Text Classification with Large Language Model-Based Data Augmentation

被引:2
|
作者
Zhao, Huanhuan [1 ]
Chen, Haihua [2 ]
Ruggles, Thomas A. [3 ]
Feng, Yunhe [4 ]
Singh, Debjani [3 ]
Yoon, Hong-Jun [5 ]
机构
[1] Univ Tennessee, Data Sci & Engn, Knoxville, TN 37996 USA
[2] Univ North Texas, Dept Informat Sci, Denton, TX 76203 USA
[3] Oak Ridge Natl Lab, Environm Sci Div, Oak Ridge, TN 37830 USA
[4] Univ North Texas, Computat Sci & Engn, Denton, TX 76203 USA
[5] Oak Ridge Natl Lab, Computat Sci & Engn Div, Oak Ridge, TN 37830 USA
关键词
data augmentation; large language model; ChatGPT; imbalanced data; text classification; natural language processing; machine learning; artificial intelligence;
D O I
10.3390/electronics13132535
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing dataset with ChatGPT or generating entirely new data from scratch. However, it is unclear which method is better without comparing their effectiveness. This study investigates the application of both methods to two datasets: a general-topic dataset (Reuters news data) and a domain-specific dataset (Mitigation dataset). Our findings indicate that: 1. ChatGPT generated new data consistently enhanced model's classification results for both datasets. 2. Generating new data generally outperforms rewriting existing data, though crafting the prompts carefully is crucial to extract the most valuable information from ChatGPT, particularly for domain-specific data. 3. The augmentation data size affects the effectiveness of DA; however, we observed a plateau after incorporating 10 samples. 4. Combining the rewritten sample with new generated sample can potentially further improve the model's performance.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] Improving Machine-Learning Diagnostics with Model-Based Data Augmentation Showcased for a Transformer Fault
    Kahlen, Jannis N.
    Andres, Michael
    Moser, Albert
    ENERGIES, 2021, 14 (20)
  • [22] LLM-BRC: A large language model-based bug report classification framework
    Du, Xiaoting
    Liu, Zhihao
    Li, Chenglong
    Ma, Xiangyue
    Li, Yingzhuo
    Wang, Xinyu
    SOFTWARE QUALITY JOURNAL, 2024, 32 (03) : 985 - 1005
  • [23] ON ROLE AND LOCATION OF NORMALIZATION BEFORE MODEL-BASED DATA AUGMENTATION IN RESIDUAL BLOCKS FOR CLASSIFICATION TASKS
    Huang, Che-Wei
    Narayanan, Shrikanth
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3322 - 3326
  • [24] Hierarchical Data Augmentation and the Application in Text Classification
    Yu, Shujuan
    Yang, Jie
    Liu, Danlei
    Li, Runqi
    Zhang, Yun
    Zhao, Shengmei
    IEEE ACCESS, 2019, 7 : 185476 - 185485
  • [25] A hidden Markov model-based text classification of medical documents
    Yi, Kwan
    Beheshti, Jamshid
    JOURNAL OF INFORMATION SCIENCE, 2009, 35 (01) : 67 - 81
  • [26] Data Augmentation for Intent Classification with Off-the-shelf Large Language Models
    Sahu, Gaurav
    Rodriguez, Pau
    Laradji, Issam H.
    Atighehchian, Parmida
    Vazquez, David
    Bandanau, Dzmitry
    PROCEEDINGS OF THE 4TH WORKSHOP ON NLP FOR CONVERSATIONAL AI, 2022, : 47 - 57
  • [27] Model-based Clustering and Classification for Data Science
    Unwin, Antony
    INTERNATIONAL STATISTICAL REVIEW, 2020, 88 (01) : 263 - 264
  • [28] Model-based clustering and classification of functional data
    Chamroukhi, Faicel
    Nguyen, Hien D.
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2019, 9 (04)
  • [29] Adaptive Model-Based Classification of PolSAR Data
    Li, Dong
    Zhang, Yunhua
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2018, 56 (12): : 6940 - 6955
  • [30] Supervised Contrast Learning Text Classification Model Based on DataQuality Augmentation
    Wu, Liang
    Zhang, Fangfang
    Cheng, Chao
    Song, Shinan
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2024, 23 (05)