Data augmentation strategies to improve text classification: a use case in smart cities

被引:0
|
作者
Bencke, Luciana [1 ]
Moreira, Viviane Pereira [1 ]
机构
[1] Fed Univ Rio Grande Sul UFRGS, Inst Informat, Porto Alegre, RS, Brazil
关键词
Data augmentation; Text classification; Low-resources; Smart cities;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Text classification is a very common and important task in Natural Language Processing. In many domains and real-world settings, a few labeled instances are the only resource available to train classifiers. Models trained on small datasets tend to overfit and produce inaccurate results - Data augmentation (DA) techniques come as an alternative to minimize this problem. DA generates synthetic instances that can be fed to the classification algorithm during training. In this article, we explore a variety of DA methods, including back translation, paraphrasing, and text generation. We assess the impact of the DA methods over simulated low-data scenarios using well-known public datasets in English with classifiers built fine-tuning BERT models. We describe the means to adapt these DA methods to augment a small Portuguese dataset containing tweets labeled with smart city dimensions (e.g., transportation, energy, water, etc.). Our experiments showed that some classes were noticeably improved by DA - with an improvement of 43% in terms of F1 compared to the baseline with no augmentation. In a qualitative analysis, we observed that the DA methods were able to preserve the label but failed to preserve the semantics in some cases and that generative models were able to produce high-quality synthetic instances.
引用
收藏
页数:36
相关论文
共 50 条
  • [1] Data augmentation strategies to improve text classification: a use case in smart cities
    Bencke, Luciana
    Moreira, Viviane Pereira
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 659 - 694
  • [2] Data Augmentation with Transformers for Text Classification
    Medardo Tapia-Tellez, Jose
    Jair Escalante, Hugo
    [J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 247 - 259
  • [3] A Survey on Data Augmentation for Text Classification
    Bayer, Markus
    Kaufhold, Marc-Andre
    Reuter, Christian
    [J]. ACM COMPUTING SURVEYS, 2023, 55 (07)
  • [4] Hierarchical Data Augmentation and the Application in Text Classification
    Yu, Shujuan
    Yang, Jie
    Liu, Danlei
    Li, Runqi
    Zhang, Yun
    Zhao, Shengmei
    [J]. IEEE ACCESS, 2019, 7 : 185476 - 185485
  • [5] Tokenization-based data augmentation for text classification
    Prakrankamanant, Patawee
    Chuangsuwanich, Ekapol
    [J]. 2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022), 2022,
  • [6] Probabilistic Interpolation with Mixup Data Augmentation for Text Classification
    Xu, Rongkang
    Zhang, Yongcheng
    Ren, Kai
    Huang, Yu
    Wei, Xiaomei
    [J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14878 : 410 - 421
  • [7] AEDA: An Easier Data Augmentation Technique for Text Classification
    Karimi, Akbar
    Rossi, Leonardo
    Prati, Andrea
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2748 - 2754
  • [8] Use Of Technology To Improve Bicycle Mobility In Smart Cities
    Stamatiadis, Nikiforos
    Pappalardo, Giuseppina
    Cafiso, Salvatore
    [J]. 2017 5TH IEEE INTERNATIONAL CONFERENCE ON MODELS AND TECHNOLOGIES FOR INTELLIGENT TRANSPORTATION SYSTEMS (MT-ITS), 2017, : 86 - 91
  • [9] Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks
    Wu, Xing
    Gao, Chaochen
    Lin, Meng
    Zang, Liangjun
    Hu, Songlin
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 871 - 875
  • [10] LiDA: Language-Independent Data Augmentation for Text Classification
    Sujana, Yudianto
    Kao, Hung-Yu
    [J]. IEEE ACCESS, 2023, 11 : 10894 - 10901