Data augmentation strategies to improve text classification: a use case in smart cities

被引:0
|
作者
Bencke, Luciana [1 ]
Moreira, Viviane Pereira [1 ]
机构
[1] Fed Univ Rio Grande Do Sul UFRGS, Inst Informat, Porto Alegre, RS, Brazil
关键词
Data augmentation; Text classification; Low-resources; Smart cities; TWITTER;
D O I
10.1007/s10579-023-09685-w
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Text classification is a very common and important task in Natural Language Processing. In many domains and real-world settings, a few labeled instances are the only resource available to train classifiers. Models trained on small datasets tend to overfit and produce inaccurate results - Data augmentation (DA) techniques come as an alternative to minimize this problem. DA generates synthetic instances that can be fed to the classification algorithm during training. In this article, we explore a variety of DA methods, including back translation, paraphrasing, and text generation. We assess the impact of the DA methods over simulated low-data scenarios using well-known public datasets in English with classifiers built fine-tuning BERT models. We describe the means to adapt these DA methods to augment a small Portuguese dataset containing tweets labeled with smart city dimensions (e.g., transportation, energy, water, etc.). Our experiments showed that some classes were noticeably improved by DA - with an improvement of 43% in terms of F1 compared to the baseline with no augmentation. In a qualitative analysis, we observed that the DA methods were able to preserve the label but failed to preserve the semantics in some cases and that generative models were able to produce high-quality synthetic instances.
引用
收藏
页码:659 / 694
页数:36
相关论文
共 50 条
  • [31] Explainable Text Classification via Attentive and Targeted Mixing Data Augmentation
    Jiang, Songhao
    Chu, Yan
    Wang, Zhengkui
    Ma, Tianxing
    Wang, Hanlin
    Lu, Wenxuan
    Zang, Tianning
    Wang, Bo
    [J]. PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 5085 - 5094
  • [32] Heavy-tailed Representations, Text Polarity Classification & Data Augmentation
    Jalalzai, Hamid
    Colombo, Pierre
    Clavel, Chloe
    Gaussier, Eric
    Varni, Giovanna
    Vignon, Emmanuel
    Sabourin, Anne
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [33] Smart Cities and Big Data Analytics: A Data-Driven Decision-Making Use Case
    Osman, Ahmed M. Shahat
    Elragal, Ahmed
    [J]. SMART CITIES, 2021, 4 (01): : 286 - 313
  • [34] Data governance for smart cities in China: the case of Shenzhen
    Xie, Siqi
    Luo, Ning
    Yarime, Masaru
    [J]. POLICY DESIGN AND PRACTICE, 2024, 7 (01) : 66 - 86
  • [35] A Data Integration Approach for Smart Cities: The Case of Natal
    Souza, Arthur
    Pereira, Jorge
    Oliveira, Juliana
    Trindade, Claudio
    Cavalcante, Everton
    Cacho, Nelio
    Batista, Thais
    Lopes, Frederico
    [J]. 2017 INTERNATIONAL SMART CITIES CONFERENCE (ISC2), 2017,
  • [36] Smart Cities as Hubs: a use case from Biotechnology
    Tsapadikou, Asteria
    Anthopoulos, Leonidas
    [J]. COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 702 - 705
  • [37] Digital Twin Use Case for Smart, Sustainable Cities
    Cardoso, Joana L. F. P.
    Rhodes, Donna H.
    [J]. PROCEEDINGS OF THE 2023 CONFERENCE ON SYSTEMS ENGINEERING RESEARCH, CSER 2023, 2024, : 99 - 115
  • [38] Producing Linked Data for Smart Cities: The Case of Catania
    Consoli, Sergio
    Presutti, Valentina
    Recupero, Diego Reforgiato
    Nuzzolese, Andrea G.
    Peroni, Silvio
    Mongiovi, Misael
    Gangemi, Aldo
    [J]. BIG DATA RESEARCH, 2017, 7 : 1 - 15
  • [39] The value of Big Data in government: The case of 'smart cities'
    Lofgren, Karl
    Webster, C. William R.
    [J]. BIG DATA & SOCIETY, 2020, 7 (01):
  • [40] EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
    Wei, Jason
    Zou, Kai
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 6382 - 6388