A Text-to-Text Model for Multilingual Offensive Language Identification

被引:0
|
作者
Ranasinghe, Tharindu [1 ]
Zampieri, Marcos [2 ]
机构
[1] Aston Univ, Birmingham, W Midlands, England
[2] George Mason Univ, Fairfax, VA USA
基金
英国工程与自然科学研究理事会;
关键词
HATE SPEECH;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XL-NET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their capabilities due to their encoder-only architecture, which restricts the number and types of labels in downstream tasks. Addressing these limitations, this study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) trained on two large offensive language identification datasets; SOLID and CCTK. We investigate the effectiveness of combining two datasets and selecting an optimal threshold in semi-supervised instances in SOLID in the T5 retraining step. Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual model achieves a new state-of-the-art on all the above datasets, showing its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.
引用
收藏
页码:375 / 384
页数:10
相关论文
共 50 条
  • [21] Adapting Pretrained Text-to-Text Models for Long Text Sequences
    Xiong, Wenhan
    Gupta, Anchit
    Toshniwal, Shubham
    Mehdad, Yashar
    Yih, Wen-tau
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5566 - 5578
  • [22] Training Text-to-Text Transformers with Privacy Guarantees
    Ponomareva, Natalia
    Bastings, Jasmijn
    Vassilvitskii, Sergei
    Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2022, : 2182 - 2193
  • [23] Training Text-to-Text Transformers with Privacy Guarantees
    Ponomareva, Natalia
    Bastings, Jasmijn
    Vassilvitskii, Sergei
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2182 - 2193
  • [24] Probabilistic Approaches for Modeling Text Structure and Their Application to Text-to-Text Generation
    Barzilay, Regina
    EMPIRICAL METHODS IN NATURAL LANGUAGE GENERATION: DATA-ORIENTED METHODS AND EMPIRICAL EVALUATION, 2010, 5790 : 1 - 12
  • [25] Argument Mining as a Text-to-Text Generation Task
    Kawarada, Masayuki
    Hirao, Tsutomu
    Uchida, Wataru
    Nagata, Masaaki
    PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 2002 - 2014
  • [26] Homograph Disambiguation with Text-to-Text Transfer Transformer
    Rezackova, Marketa
    Tihelka, Daniel
    Matousek, Jindrich
    INTERSPEECH 2024, 2024, : 2785 - 2789
  • [27] Deep copycat Networks for Text-to-Text Generation
    He, Julia
    Madhyastha, Pranava
    Specia, Lucia
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 3227 - 3236
  • [28] Language model adaptation for language and dialect identification of text
    Jauhiainen, T.
    Linden, K.
    Jauhiainen, H.
    NATURAL LANGUAGE ENGINEERING, 2019, 25 (05) : 561 - 583
  • [29] Bootstrap an End-to-end ASR System by Multilingual Training, Transfer Learning, Text-to-text Mapping and Synthetic Audio
    Giollo, Manuel
    Gunceler, Deniz
    Liu, Yulan
    Willett, Daniel
    INTERSPEECH 2021, 2021, : 2416 - 2420
  • [30] Distractor Generation Through Text-to-Text Transformer Models
    de-Fitero-Dominguez, David
    Garcia-Lopez, Eva
    Garcia-Cabot, Antonio
    del-Hoyo-Gabaldon, Jesus-Angel
    Moreno-Cediel, Antonio
    IEEE ACCESS, 2024, 12 : 25580 - 25589