Comparing pre-trained language models for Spanish hate speech detection

被引:66
|
作者
Miriam Plaza-del-Arco, Flor [1 ]
Dolores Molina-Gonzalez, M. [1 ]
Alfonso Urena-Lopez, L. [1 ]
Teresa Martin-Valdivia, M. [1 ]
机构
[1] Univ Jaen, Adv Studies Ctr Informat & Commun Technol CEATIC, Dept Comp Sci, Campus Lagunillas, E-23071 Jaen, Spain
关键词
Hate speech; Transfer learning; BERT; BETO; Natural language processing; Text classification;
D O I
10.1016/j.eswa.2020.114120
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, due to the great uncontrolled content posted daily on the Web, there has also been a huge increase in the dissemination of hate speech worldwide. Social media, blogs and community forums are examples where people are freely allowed to communicate. However, freedom of expression is not always respectful since offensive or insulting language is sometimes used. Social media companies often rely on users and content moderators to report on this type of content. Nevertheless, due to the large amount of content generated every day on the Web, automatic systems based on Natural Language Processing techniques are required for identifying abusive language online. To date, most of the systems developed to combat this problem are mainly focused on English content, but this issue is a worldwide concern and therefore other languages such as Spanish are involved. In this paper, we address the task of Spanish hate speech identification on social media and provide a deeper understanding of the capabilities of new techniques based on machine learning. In particular, we compare the performance of Deep Learning methods with recently pre-trained language models based on Transfer Learning as well as with traditional machine learning models. Our main contribution is the achievement of promising results in Spanish by applying multilingual and monolingual pre-trained language models such as BERT, XLM and BETO.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] Comparing Pre-Trained Language Model for Arabic Hate Speech Detection
    Daouadi, Kheir Eddine
    Boualleg, Yaakoub
    Guehairia, Oussama
    [J]. COMPUTACION Y SISTEMAS, 2024, 28 (02): : 681 - 693
  • [2] Combining multiple pre-trained models for hate speech detection in Bengali, Marathi, and Hindi
    Nandi, Arpan
    Sarkar, Kamal
    Mallick, Arjun
    De, Arkadeep
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (32) : 77733 - 77757
  • [3] Using Pre-Trained Language Models for Producing Counter Narratives Against Hate Speech: a Comparative Study
    Tekiroglu, Serra Sinem
    Bonaldi, Helena
    Fanton, Margherita
    Guerini, Marco
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 3099 - 3114
  • [4] Development of pre-trained language models for clinical NLP in Spanish
    Aracena, Claudio
    Dunstan, Jocelyn
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 52 - 60
  • [5] Pre-trained Biomedical Language Models for Clinical NLP in Spanish
    Pio Carrino, Casimiro
    Llop, Joan
    Pamies, Marc
    Gutierrez-Fandino, Asier
    Armengol-Estape, Jordi
    Silveira-Ocampo, Joaquin
    Aitor Gonzalez-Agirre, Alfonso Valencia
    Villegas, Marta
    [J]. PROCEEDINGS OF THE 21ST WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2022), 2022, : 193 - 199
  • [6] COVID-HateBERT: a Pre-trained Language Model for COVID-19 related Hate Speech Detection
    Li, Mingqi
    Liao, Song
    Okpala, Ebuka
    Tong, Max
    Costello, Matthew
    Cheng, Long
    Hu, Hongxin
    Luo, Feng
    [J]. 20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 233 - 238
  • [7] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    [J]. ENGINEERING, 2023, 25 (51-65): : 51 - 65
  • [8] Adapting Pre-trained Language Models to Rumor Detection on Twitter
    Slimi, Hamda
    Bounhas, Ibrahim
    Slimani, Yahya
    [J]. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2021, 27 (10) : 1128 - 1148
  • [9] Annotating Columns with Pre-trained Language Models
    Suhara, Yoshihiko
    Li, Jinfeng
    Li, Yuliang
    Zhang, Dan
    Demiralp, Cagatay
    Chen, Chen
    Tan, Wang-Chiew
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22), 2022, : 1493 - 1503
  • [10] An Empirical study on Pre-trained Embeddings and Language Models for Bot Detection
    Garcia-Silva, Andres
    Berrio, Cristian
    Manuel Gomez-Perez, Jose
    [J]. 4TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP (REPL4NLP-2019), 2019, : 148 - 155