Data augmentation strategies to improve text classification: a use case in smart cities

被引：0

作者：

Bencke, Luciana ^{[1
]}

Moreira, Viviane Pereira ^{[1
]}

机构：

[1] Fed Univ Rio Grande Sul UFRGS, Inst Informat, Porto Alegre, RS, Brazil

来源：

LANGUAGE RESOURCES AND EVALUATION | 2023年

关键词：

Data augmentation; Text classification; Low-resources; Smart cities;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Text classification is a very common and important task in Natural Language Processing. In many domains and real-world settings, a few labeled instances are the only resource available to train classifiers. Models trained on small datasets tend to overfit and produce inaccurate results - Data augmentation (DA) techniques come as an alternative to minimize this problem. DA generates synthetic instances that can be fed to the classification algorithm during training. In this article, we explore a variety of DA methods, including back translation, paraphrasing, and text generation. We assess the impact of the DA methods over simulated low-data scenarios using well-known public datasets in English with classifiers built fine-tuning BERT models. We describe the means to adapt these DA methods to augment a small Portuguese dataset containing tweets labeled with smart city dimensions (e.g., transportation, energy, water, etc.). Our experiments showed that some classes were noticeably improved by DA - with an improvement of 43% in terms of F1 compared to the baseline with no augmentation. In a qualitative analysis, we observed that the DA methods were able to preserve the label but failed to preserve the semantics in some cases and that generative models were able to produce high-quality synthetic instances.

引用

页数：36

共 50 条

[1] Data augmentation strategies to improve text classification: a use case in smart cities
Bencke, Luciana
Moreira, Viviane Pereira
[J]. LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 659 - 694
[2] Data Augmentation with Transformers for Text Classification
Medardo Tapia-Tellez, Jose
Jair Escalante, Hugo
[J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 247 - 259
[3] A Survey on Data Augmentation for Text Classification
Bayer, Markus
Kaufhold, Marc-Andre
Reuter, Christian
[J]. ACM COMPUTING SURVEYS, 2023, 55 (07)
[4] Hierarchical Data Augmentation and the Application in Text Classification
Yu, Shujuan
Yang, Jie
Liu, Danlei
Li, Runqi
Zhang, Yun
Zhao, Shengmei
[J]. IEEE ACCESS, 2019, 7 : 185476 - 185485
[5] Tokenization-based data augmentation for text classification
Prakrankamanant, Patawee
Chuangsuwanich, Ekapol
[J]. 2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022), 2022,
[6] Probabilistic Interpolation with Mixup Data Augmentation for Text Classification
Xu, Rongkang
Zhang, Yongcheng
Ren, Kai
Huang, Yu
Wei, Xiaomei
[J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14878 : 410 - 421
[7] AEDA: An Easier Data Augmentation Technique for Text Classification
Karimi, Akbar
Rossi, Leonardo
Prati, Andrea
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2748 - 2754
[8] Use Of Technology To Improve Bicycle Mobility In Smart Cities
Stamatiadis, Nikiforos
Pappalardo, Giuseppina
Cafiso, Salvatore
[J]. 2017 5TH IEEE INTERNATIONAL CONFERENCE ON MODELS AND TECHNOLOGIES FOR INTELLIGENT TRANSPORTATION SYSTEMS (MT-ITS), 2017, : 86 - 91
[9] Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks
Wu, Xing
Gao, Chaochen
Lin, Meng
Zang, Liangjun
Hu, Songlin
[J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, 2022, : 871 - 875
[10] LiDA: Language-Independent Data Augmentation for Text Classification
Sujana, Yudianto
Kao, Hung-Yu
[J]. IEEE ACCESS, 2023, 11 : 10894 - 10901

← 1 2 3 4 5 →