A Turkish Hate Speech Dataset and Detection System

被引:0
|
作者
Beyhan, Fatih [1 ,2 ]
Carik, Buse [1 ,2 ]
Arin, Inanc [1 ,2 ]
Terzioglu, Aysecan [3 ]
Yanikoglu, Berrin [1 ,2 ]
Yeniterzi, Reyyan [1 ,2 ]
机构
[1] Sabanci Univ, Fac Engn & Nat Sci, TR-34956 Istanbul, Turkey
[2] Sabanci Univ, Ctr Excellence Data Analyt VERIM, TR-34956 Istanbul, Turkey
[3] Sabanci Univ, Fac Arts & Social Sci, TR-34956 Istanbul, Turkey
关键词
Hate speech detection; Deep learning; Turkish;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Social media posts containing hate speech are reproduced and redistributed at an accelerated pace, reaching greater audiences at a higher speed. We present a machine learning system for automatic detection of hate speech in Turkish, along with a hate speech dataset consisting of tweets collected in two separate domains. We first adopted a definition for hate speech that is in line with our goals and amenable to easy annotation; then designed the annotation schema for annotating the collected tweets. The Istanbul Convention dataset consists of tweets posted following the withdrawal of Turkey from the Istanbul Convention. The Refugees dataset was created by collecting tweets about immigrants by filtering based on commonly used keywords related to immigrants. Finally, we have developed a hate speech detection system using the transformer architecture (BERTurk), to be used as a baseline for the collected dataset. The binary classification accuracy is 77% when the system is evaluated using 5-fold cross validation on the Istanbul Convention dataset and 71% for the Refugee dataset. We also tested a regression model with 0.66 and 0.83 RMSE on a scale of [0-4], for the Istanbul Convention and Refugees datasets.
引用
下载
收藏
页码:4177 / 4185
页数:9
相关论文
共 50 条
  • [1] Annotation System to Build Cyberbullying and Hate Speech Detection Model Training Dataset
    Febriana, Trisna
    Budiarto, Arif
    CHIUXID 2020: 6TH INTERNATIONAL ACM IN-COOPERATION HCI AND UX CONFERENCE, 2020, : 29 - 30
  • [2] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection
    Mathew, Binny
    Saha, Punyajoy
    Yimam, Seid Muhie
    Biemann, Chris
    Goyal, Pawan
    Mukherjee, Animesh
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 14867 - 14875
  • [3] Towards an Organically Growing Hate Speech Dataset in Hate Speech Detection Systems in a Smart Mobility Application
    Alsamman, Ahmad
    Schmitz, Andreas
    Wimmer, Maria A.
    TOGETHER IN THE UNSTABLE WORLD: DIGITAL GOVERNMENT AND SOLIDARITY, 2023, : 36 - 43
  • [4] A curated dataset for hate speech detection on social media text
    Mody, Devansh
    Huang, YiDong
    de Oliveira, Thiago Eustaquio Alves
    DATA IN BRIEF, 2023, 46
  • [5] ETHOS: a multi-label hate speech detection dataset
    Mollas, Ioannis
    Chrysopoulou, Zoe
    Karlos, Stamatis
    Tsoumakas, Grigorios
    COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (06) : 4663 - 4678
  • [6] ETHOS: a multi-label hate speech detection dataset
    Ioannis Mollas
    Zoe Chrysopoulou
    Stamatis Karlos
    Grigorios Tsoumakas
    Complex & Intelligent Systems, 2022, 8 : 4663 - 4678
  • [7] Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study
    Alfina, Ika
    Mulia, Rio
    Fanany, Mohamad Ivan
    Ekanata, Yudo
    2017 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2017, : 233 - 237
  • [8] TABHATE: A Target-based hate speech detection dataset in Hindi
    Sharma, Deepawali
    Singh, Vivek Kumar
    Gupta, Vedika
    SOCIAL NETWORK ANALYSIS AND MINING, 2024, 14 (01)
  • [9] Arabic hate speech detection system based on AraBERT
    Higher Institute of Computer, Science and Multimedia of Sfax, sfax, Tunisia
    不详
    Proc. IEEE Int. Conf. Cogn. Informatics Cogn. Comput. ICCI*CC, 2022, (208-213):
  • [10] YouTube based religious hate speech and extremism detection dataset with machine learning baselines
    Ashraf, Noman
    Rafiq, Abid
    Butt, Sabur
    Shehzad, Hafiz Muhammad Faisal
    Sidorov, Grigori
    Gelbukh, Alexander
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2022, 42 (05) : 4769 - 4777