A Text-to-Text Model for Multilingual Offensive Language Identification

被引:0
|
作者
Ranasinghe, Tharindu [1 ]
Zampieri, Marcos [2 ]
机构
[1] Aston Univ, Birmingham, W Midlands, England
[2] George Mason Univ, Fairfax, VA USA
基金
英国工程与自然科学研究理事会;
关键词
HATE SPEECH;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XL-NET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their capabilities due to their encoder-only architecture, which restricts the number and types of labels in downstream tasks. Addressing these limitations, this study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) trained on two large offensive language identification datasets; SOLID and CCTK. We investigate the effectiveness of combining two datasets and selecting an optimal threshold in semi-supervised instances in SOLID in the T5 retraining step. Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual model achieves a new state-of-the-art on all the above datasets, showing its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.
引用
收藏
页码:375 / 384
页数:10
相关论文
共 50 条
  • [1] HaT5: Hate Language Identification using Text-to-Text Transfer Transformer
    Sabry, Sana Sabah
    Adewumi, Tosin
    Abid, Nosheen
    Kovacs, Gyorgy
    Liwicki, Foteini
    Liwicki, Marcus
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [2] mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs
    Chi, Zewen
    Dong, Li
    Ma, Shuming
    Huang, Shaohan
    Mao, Xian-Ling
    Huang, Heyan
    Wei, Furu
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1671 - 1683
  • [3] mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences
    Uthus, David
    Ontanion, Santiago
    Ainslie, Joshua
    Guo, Mandy
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 9380 - 9386
  • [4] Text-to-text generative approach for enhanced complex word identification
    Sliwiak, Patrycja
    Shah, Syed Afaq Ali
    NEUROCOMPUTING, 2024, 610
  • [5] Evaluation of Transfer Learning for Polish with a Text-to-Text Model
    Chrabrowa, Aleksandra
    Dragan, Lukasz
    Grzegorczyk, Karol
    Kajtoch, Dariusz
    Koszowski, Mikolaj
    Mroczkowski, Robert
    Rybak, Piotr
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4374 - 4394
  • [6] Text-based Language Identification of Multilingual Names
    Giwa, Oluwapelumi
    Davel, Marelie H.
    PROCEEDINGS OF THE 2015 PATTERN RECOGNITION ASSOCIATION OF SOUTH AFRICA AND ROBOTICS AND MECHATRONICS INTERNATIONAL CONFERENCE (PRASA-ROBMECH), 2015, : 166 - 171
  • [7] mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
    Xue, Linting
    Constant, Noah
    Roberts, Adam
    Kale, Mihir
    Al-Rfou, Rami
    Siddhant, Aditya
    Barua, Aditya
    Raffel, Colin
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 483 - 498
  • [8] Leveraging Text-to-Text Pretrained Language Models for Question Answering in Chemistry
    Tran, Dan
    Pascazio, Laura
    Akroyd, Jethro
    Mosbach, Sebastian
    Kraft, Markus
    ACS OMEGA, 2024, 9 (12): : 13883 - 13896
  • [9] AraT5: Text-to-Text Transformers for Arabic Language Generation
    Nagoudi, El Moatez Billah
    Elmadany, AbdelRahim
    Abdul-Mageed, Muhammad
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 628 - 647
  • [10] ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation
    Long Phan
    Hieu Tran
    Hieu Nguyen
    Trinh, Trieu H.
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES: PROCEEDINGS OF THE STUDENT RESEARCH WORKSHOP, 2022, : 136 - 142