A Text-to-Text Model for Multilingual Offensive Language Identification

被引:0
|
作者
Ranasinghe, Tharindu [1 ]
Zampieri, Marcos [2 ]
机构
[1] Aston Univ, Birmingham, W Midlands, England
[2] George Mason Univ, Fairfax, VA USA
基金
英国工程与自然科学研究理事会;
关键词
HATE SPEECH;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XL-NET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their capabilities due to their encoder-only architecture, which restricts the number and types of labels in downstream tasks. Addressing these limitations, this study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) trained on two large offensive language identification datasets; SOLID and CCTK. We investigate the effectiveness of combining two datasets and selecting an optimal threshold in semi-supervised instances in SOLID in the T5 retraining step. Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual model achieves a new state-of-the-art on all the above datasets, showing its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.
引用
收藏
页码:375 / 384
页数:10
相关论文
共 50 条
  • [31] Product Titles-to-Attributes As a Text-to-Text Task
    Fuchs, Gilad
    Acriche, Yoni
    PROCEEDINGS OF THE 5TH WORKSHOP ON E-COMMERCE AND NLP (ECNLP 5), 2022, : 91 - 98
  • [32] Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need
    Peng, Cheng
    Yang, Xi
    Chen, Aokun
    Yu, Zehao
    Smith, Kaleb E.
    Costa, Anthony B.
    Flores, Mona G.
    Bian, Jiang
    Wu, Yonghui
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 31 (09) : 1892 - 1903
  • [33] TESS: Text-to-Text Self-Conditioned Simplex Diffusion
    Mahabadi, Rabeeh Karimi
    Ivison, Hamish
    Tae, Jaesung
    Henderson, James
    Beltagy, Iz
    Peters, Matthew E.
    Cohan, Arman
    PROCEEDINGS OF THE 18TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 2347 - 2361
  • [34] Assessing the Stability of Text-to-Text Models for Keyword Generation Tasks
    Walkowiak, Tomasz
    COMPUTATIONAL SCIENCE, ICCS 2024, PT III, 2024, 14834 : 112 - 119
  • [35] TFix: Learning to Fix Coding Errors with a Text-to-Text Transformer
    Berabi, Berkay
    He, Jingxuan
    Raychev, Veselin
    Vechev, Martin
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [36] Text independent speaker identification in multilingual environments
    Luengo, Iker
    Navas, Eva
    Sainz, Inaki
    Saratxaga, Ibon
    Sanchez, Jon
    Odriozola, Igor
    Hernaez, Inma
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1814 - 1817
  • [37] Text Line Identification from a Multilingual Document
    Vijaya, P. A.
    Padma, M. C.
    ICDIP 2009: INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING, PROCEEDINGS, 2009, : 302 - +
  • [38] Exploring the limits of transfer learning with a unified text-to-text transformer
    Raffel, Colin
    Shazeer, Noam
    Roberts, Adam
    Lee, Katherine
    Narang, Sharan
    Matena, Michael
    Zhou, Yanqi
    Li, Wei
    Liu, Peter J.
    Journal of Machine Learning Research, 2020, 21
  • [39] MTDOT: A Multilingual Translation-Based Data Augmentation Technique for Offensive Content Identification in Tamil Text Data
    Ganganwar, Vaishali
    Rajalakshmi, Ratnavel
    ELECTRONICS, 2022, 11 (21)
  • [40] Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
    Raffel, Colin
    Shazeer, Noam
    Roberts, Adam
    Lee, Katherine
    Narang, Sharan
    Matena, Michael
    Zhou, Yanqi
    Li, Wei
    Liu, Peter J.
    JOURNAL OF MACHINE LEARNING RESEARCH, 2020, 21