A Text-to-Text Model for Multilingual Offensive Language Identification

被引:0
|
作者
Ranasinghe, Tharindu [1 ]
Zampieri, Marcos [2 ]
机构
[1] Aston Univ, Birmingham, W Midlands, England
[2] George Mason Univ, Fairfax, VA USA
基金
英国工程与自然科学研究理事会;
关键词
HATE SPEECH;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The ubiquity of offensive content on social media is a growing cause for concern among companies and government organizations. Recently, transformer-based models such as BERT, XL-NET, and XLM-R have achieved state-of-the-art performance in detecting various forms of offensive content (e.g. hate speech, cyberbullying, and cyberaggression). However, the majority of these models are limited in their capabilities due to their encoder-only architecture, which restricts the number and types of labels in downstream tasks. Addressing these limitations, this study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5) trained on two large offensive language identification datasets; SOLID and CCTK. We investigate the effectiveness of combining two datasets and selecting an optimal threshold in semi-supervised instances in SOLID in the T5 retraining step. Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks. Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5 and evaluate its performance on a set of six different languages (German, Hindi, Korean, Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual model achieves a new state-of-the-art on all the above datasets, showing its usefulness in multilingual scenarios. Our proposed T5-based models will be made freely available to the community.
引用
收藏
页码:375 / 384
页数:10
相关论文
共 50 条
  • [41] Multilingual Offensive Language Identification with Cross-lingual Embeddings
    Ranasinghe, Tharindu
    Zampieri, Marcos
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 5838 - 5844
  • [42] An Evaluation of Multilingual Offensive Language Identification Methods for the Languages of India
    Ranasinghe, Tharindu
    Zampieri, Marcos
    INFORMATION, 2021, 12 (08)
  • [43] Multilingual Offensive Language Identification for Low-resource Languages
    Ranasinghe, Tharindu
    Zampieri, Marcos
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [44] Multilingual Text-to-Speech Software Component for Dynamic Language Identification and Voice Switching
    Fogarassy-Neszly, Paul
    Pribeanu, Costin
    STUDIES IN INFORMATICS AND CONTROL, 2016, 25 (03): : 335 - 342
  • [45] Language Identification for Text Chats
    Siivola, Vesa
    Pellom, Bryan
    Sills, Meagan
    12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 2940 - 2943
  • [46] IDENTIFICATION OF LANGUAGE OF A FOREIGN TEXT
    DEMCHENK.KK
    NAUCHNO-TEKHNICHESKAYA INFORMATSIYA SERIYA 1-ORGANIZATSIYA I METODIKA INFORMATSIONNOI RABOTY, 1967, (02): : 25 - &
  • [47] Offensive Language Detection on Social Media Based on Text Classification
    Hajibabaee, Parisa
    Malekzadeh, Masoud
    Ahmadi, Mohsen
    Heidari, Maryam
    Esmaeilzadeh, Armin
    Abdolazimi, Reyhaneh
    Jones, James H., Jr.
    2022 IEEE 12TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2022, : 92 - 98
  • [48] The Xiaomi Text-to-Text Simultaneous Speech Translation System for IWSLT 2022
    Guo, Bao
    Liu, Mengge
    Zhang, Wen
    Chen, Hexuan
    Mu, Chang
    Li, Xiang
    Cui, Jianwei
    Wang, Bin
    Guo, Yuhang
    PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2022), 2022, : 216 - 224
  • [49] Text-to-Text Surface Realisation Using Dependency-Tree Replacement
    de Novais, Eder Miranda
    Tadeu, Thiago Dias
    Paraboni, Ivandre
    ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA 2010, 2010, 6433 : 326 - 335
  • [50] ShefCDTeam at SemEval-2024 Task 4: A Text-to-Text Model for Multi-Label Classification
    Gibbons, Meredith
    Mi, Maggie
    Villavicencio, Aline
    Song, Xingyi
    PROCEEDINGS OF THE 18TH INTERNATIONAL WORKSHOP ON SEMANTIC EVALUATION, SEMEVAL-2024, 2024, : 1860 - 1867