The problem of varying annotations to identify abusive language in social media content

被引:1
|
作者
Seemann, Nina [1 ]
Lee, Yeong Su [1 ]
Hoellig, Julian [1 ]
Geierhos, Michaela [1 ]
机构
[1] Univ Bundeswehr Munich, Res Inst CODE, Neubiberg, Germany
关键词
Natural Language Processing; Abusive Language; Dataset Analysis;
D O I
10.1017/S1351324923000098
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the increase of user-generated content on social media, the detection of abusive language has become crucial and is therefore reflected in several shared tasks that have been performed in recent years. The development of automatic detection systems is desirable, and the classification of abusive social media content can be solved with the help of machine learning. The basis for successful development of machine learning models is the availability of consistently labeled training data. But a diversity of terms and definitions of abusive language is a crucial barrier. In this work, we analyze a total of nine datasets-five English and four German datasets-designed for detecting abusive online content. We provide a detailed description of the datasets, that is, for which tasks the dataset was created, how the data were collected, and its annotation guidelines. Our analysis shows that there is no standard definition of abusive language, which often leads to inconsistent annotations. As a consequence, it is difficult to draw cross-domain conclusions, share datasets, or use models for other abusive social media language tasks. Furthermore, our manual inspection of a random sample of each dataset revealed controversial examples. We highlight challenges in data annotation by discussing those examples, and present common problems in the annotation process, such as contradictory annotations and missing context information. Finally, to complement our theoretical work, we conduct generalization experiments on three German datasets.
引用
收藏
页码:1561 / 1585
页数:25
相关论文
共 50 条
  • [21] Gaali cultures: The politics of abusive exchange on social media
    Udupa, Sahana
    NEW MEDIA & SOCIETY, 2018, 20 (04) : 1506 - 1522
  • [22] Abusive Metajournalistic Discourse Towards Journalists on Social Media
    Al-Rawi, Ahmed
    Siddiqi, Maliha
    Al-Musalli, Alaa
    JOURNALISM STUDIES, 2024, 25 (10) : 1117 - 1137
  • [23] Using social annotations to smooth the language model for IR
    Xu, Shengliang
    Bao, Shenghua
    Yu, Yong
    Cao, Yunbo
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2007, 4426 : 1015 - +
  • [24] Abusive language detection from social media comments using conventional machine learning and deep learning approaches
    Muhammad Pervez Akhter
    Zheng Jiangbin
    Irfan Raza Naqvi
    Mohammed AbdelMajeed
    Tehseen Zia
    Multimedia Systems, 2022, 28 : 1925 - 1940
  • [25] Abusive language detection from social media comments using conventional machine learning and deep learning approaches
    Akhter, Muhammad Pervez
    Jiangbin, Zheng
    Naqvi, Irfan Raza
    AbdelMajeed, Mohammed
    Zia, Tehseen
    MULTIMEDIA SYSTEMS, 2022, 28 (06) : 1925 - 1940
  • [26] Learning to Effectively Identify Reliable Content in Health Social Platforms with Large Language Models
    Liu, Caihua
    Zhou, Hui
    Su, Lishen
    Huang, Yaosheng
    Peng, Guochao
    Wu, Dayou
    Kong, Shufeng
    DISTRIBUTED, AMBIENT AND PERVASIVE INTERACTIONS, PT II, DAPI 2024, 2024, 14719 : 55 - 67
  • [27] Extracting Metadata from Multimedia Content on Facebook as Media Annotations
    Alves, Miguel B.
    Damasio, Carlos Viegas
    Correia, Nuno
    KNOWLEDGE ENGINEERING AND SEMANTIC WEB, KESW 2015, 2015, 518 : 243 - 252
  • [28] Automatic Discovery of Abusive Thai Language Usages in Social Networks
    Tuarob, Suppawong
    Mitrpanont, Jarernsri L.
    DIGITAL LIBRARIES: DATA, INFORMATION, AND KNOWLEDGE FOR DIGITAL LIVES, 2017, 10647
  • [29] Classification of Abusive Comments in Social Media using Deep Learning
    Anand, Mukul
    Eswari, R.
    PROCEEDINGS OF THE 2019 3RD INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION (ICCMC 2019), 2019, : 974 - 977
  • [30] Correction to: Abusive language detection from social media comments using conventional machine learning and deep learning approaches
    Muhammad Pervez Akhter
    Zheng Jiangbin
    Irfan Raza Naqvi
    Mohammed AbdelMajeed
    Tehseen Zia
    Multimedia Systems, 2023, 29 : 451 - 451