Label modification and bootstrapping for zero-shot cross-lingual hate speech detection

被引:7
|
作者
Bigoulaeva, Irina [1 ]
Hangya, Viktor [2 ]
Gurevych, Iryna [1 ]
Fraser, Alexander [2 ]
机构
[1] Tech Univ Darmstadt, Dept Comp Sci, Ubiquitous Knowledge Proc Lab, UKP Lab, Darmstadt, Germany
[2] Ludwig Maximilians Univ Munchen, Ctr Informat & Language Proc, Munich, Germany
关键词
Hate speech; Cross-lingual transfer learning; Class imbalance; BERT; CNN; LSTM;
D O I
10.1007/s10579-023-09637-4
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The goal of hate speech detection is to filter negative online content aiming at certain groups of people. Due to the easy accessibility and multilinguality of social media platforms, it is crucial to protect everyone which requires building hate speech detection systems for a wide range of languages. However, the available labeled hate speech datasets are limited, making it difficult to build systems for many languages. In this paper we focus on cross-lingual transfer learning to support hate speech detection in low-resource languages, while highlighting label issues across application scenarios, such as inconsistent label sets of corpora or differing hate speech definitions, which hinder the application of such methods. We leverage cross-lingual word embeddings to train our neural network systems on the source language and apply them to the target language, which lacks labeled examples, and show that good performance can be achieved. We then incorporate unlabeled target language data for further model improvements by bootstrapping labels using an ensemble of different model architectures. Furthermore, we investigate the issue of label imbalance in hate speech datasets, since the high ratio of non-hate examples compared to hate examples often leads to low model performance. We test simple data undersampling and oversampling techniques and show their effectiveness.
引用
收藏
页码:1515 / 1546
页数:32
相关论文
共 50 条
  • [41] Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training
    Huang, Kuan-Hao
    Ahmad, Wasi Uddin
    Peng, Nanyun
    Chang, Kai-Wei
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1684 - 1697
  • [42] Why Does Zero-Shot Cross-Lingual Generation Fail? An Explanation and a Solution
    Li, Tianjian
    Murray, Kenton
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 12461 - 12476
  • [43] Beyond the EnglishWeb: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers
    Repo, Liina
    Skantsi, Valtteri
    Ronnqvist, Samuel
    Hellstrom, Saara
    Oinonen, Miika
    Salmela, Anna
    Biber, Douglas
    Egbert, Jesse
    Pyysalo, Sampo
    Laippala, Veronika
    EACL 2021: THE 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2021, : 183 - 191
  • [44] Adversarial Propagation and Zero-Shot Cross-Lingual Transfer of Word Vector Specialization
    Ponti, Edoardo M.
    Vulic, Ivan
    Glavas, Goran
    Mrksic, Nikola
    Korhonen, Anna
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 282 - 293
  • [45] The Impact of Cross-Lingual Adjustment of Contextual Word Representations on Zero-Shot Transfer
    Efimov, Pavel
    Boytsov, Leonid
    Arslanova, Elena
    Braslavski, Pavel
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT III, 2023, 13982 : 51 - 67
  • [46] Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT
    Chen, Beiduo
    Guo, Wu
    Liu, Quan
    Tao, Kun
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1428 - 1435
  • [47] Zero-Shot Cross-Lingual Knowledge Transfer in VQA via Multimodal Distillation
    Weng, Yu
    Dong, Jun
    He, Wenbin
    Chaomurilige
    Liu, Xuan
    Liu, Zheng
    Gao, Honghao
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2024, : 1 - 11
  • [48] Improving Cross-lingual Text Classification with Zero-shot Instance-Weighting
    Li, Irene
    Sen, Prithviraj
    Zhu, Huaiyu
    Li, Yunyao
    Radev, Dragomir
    REPL4NLP 2021: PROCEEDINGS OF THE 6TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP, 2021, : 1 - 7
  • [49] Cross-lingual Capsule Network for Hate Speech Detection in Social Media
    Jiang, Aiqi
    Zubiaga, Arkaitz
    PROCEEDINGS OF THE 32ND ACM CONFERENCE ON HYPERTEXT AND SOCIAL MEDIA (HT '21), 2021, : 217 - 223
  • [50] Multilingual Generative Language Models for Zero-Shot Cross-Lingual Event Argument Extraction
    Huang, Kuan-Hao
    Hsu, I-Hung
    Natarajan, Premkumar
    Chang, Kai-Wei
    Peng, Nanyun
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4633 - 4646