The problem of varying annotations to identify abusive language in social media content

被引:1
|
作者
Seemann, Nina [1 ]
Lee, Yeong Su [1 ]
Hoellig, Julian [1 ]
Geierhos, Michaela [1 ]
机构
[1] Univ Bundeswehr Munich, Res Inst CODE, Neubiberg, Germany
关键词
Natural Language Processing; Abusive Language; Dataset Analysis;
D O I
10.1017/S1351324923000098
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the increase of user-generated content on social media, the detection of abusive language has become crucial and is therefore reflected in several shared tasks that have been performed in recent years. The development of automatic detection systems is desirable, and the classification of abusive social media content can be solved with the help of machine learning. The basis for successful development of machine learning models is the availability of consistently labeled training data. But a diversity of terms and definitions of abusive language is a crucial barrier. In this work, we analyze a total of nine datasets-five English and four German datasets-designed for detecting abusive online content. We provide a detailed description of the datasets, that is, for which tasks the dataset was created, how the data were collected, and its annotation guidelines. Our analysis shows that there is no standard definition of abusive language, which often leads to inconsistent annotations. As a consequence, it is difficult to draw cross-domain conclusions, share datasets, or use models for other abusive social media language tasks. Furthermore, our manual inspection of a random sample of each dataset revealed controversial examples. We highlight challenges in data annotation by discussing those examples, and present common problems in the annotation process, such as contradictory annotations and missing context information. Finally, to complement our theoretical work, we conduct generalization experiments on three German datasets.
引用
收藏
页码:1561 / 1585
页数:25
相关论文
共 50 条
  • [1] Classification of Abusive Thai Language Content in Social Media Using Deep Learning
    Wanasukapunt, Ruangsung
    Phimoltares, Suphakant
    2021 18TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE-2021), 2021,
  • [2] A Novel HybridModel of Word Embedding and Deep Learning to Identify Hate and Abusive Content on Social Media Platform
    Kumar, Sachin
    Bhagat, Ankit Kumar
    Erugurala, Akash
    Mirza, Amna
    Jha, Alok Nikhil
    Verma, Ajit Kumar
    FRONTIERS OF ARTIFICIAL INTELLIGENCE, ETHICS, AND MULTIDISCIPLINARY APPLICATIONS, FAIEMA 2023, 2024, : 39 - 50
  • [3] Abusive Language on Social Media Through the Legal Looking Glass
    Bertaglia, Thales
    Grigoriu, Andreea
    Dumontier, Michel
    van Dijck, Gijs
    WOAH 2021: THE 5TH WORKSHOP ON ONLINE ABUSE AND HARMS, 2021, : 191 - 200
  • [4] User-aware multilingual abusive content detection in social media
    Rehman, Mohammad Zia Ur
    Mehta, Somya
    Singh, Kuldeep
    Kaushik, Kunal
    Kumar, Nagendra
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (05)
  • [5] Harnessing the Power of Text Mining for the Detection of Abusive Content in Social Media
    Chen, Hao
    Mckeever, Susan
    Delany, Sarah Jane
    ADVANCES IN COMPUTATIONAL INTELLIGENCE SYSTEMS, 2017, 513 : 187 - 205
  • [6] Automatic Detection of Cyberbullying and Abusive Language in Arabic Content on Social Networks: A Survey
    Khairy, Marwa
    Mahmoud, Tarek M.
    Abd-El-Hafeez, Tarek
    AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 156 - 166
  • [7] Hate speech and abusive language detection in Indonesian social media: Progress and challenges
    Ibrohim, Muhammad Okky
    Budi, Indra
    HELIYON, 2023, 9 (08)
  • [8] Hidden behind the obvious: Misleading keywords and implicitly abusive language on social media
    Yin, Wenjie
    Zubiaga, Arkaitz
    ONLINE SOCIAL NETWORKS AND MEDIA, 2022, 30
  • [9] Abusive Language Detection in Online User Content
    Nobata, Chikashi
    Tetreault, Joel
    Thomas, Achint
    Mehdad, Yashar
    Chang, Yi
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16), 2016, : 145 - 153
  • [10] Detection of Hateful Social Media Content for Arabic Language
    Al-Ibrahim, Rogayah M.
    Ali, Mostafa Z.
    Najadat, Hassan M.
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (09)