The problem of varying annotations to identify abusive language in social media content

被引:1
|
作者
Seemann, Nina [1 ]
Lee, Yeong Su [1 ]
Hoellig, Julian [1 ]
Geierhos, Michaela [1 ]
机构
[1] Univ Bundeswehr Munich, Res Inst CODE, Neubiberg, Germany
关键词
Natural Language Processing; Abusive Language; Dataset Analysis;
D O I
10.1017/S1351324923000098
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the increase of user-generated content on social media, the detection of abusive language has become crucial and is therefore reflected in several shared tasks that have been performed in recent years. The development of automatic detection systems is desirable, and the classification of abusive social media content can be solved with the help of machine learning. The basis for successful development of machine learning models is the availability of consistently labeled training data. But a diversity of terms and definitions of abusive language is a crucial barrier. In this work, we analyze a total of nine datasets-five English and four German datasets-designed for detecting abusive online content. We provide a detailed description of the datasets, that is, for which tasks the dataset was created, how the data were collected, and its annotation guidelines. Our analysis shows that there is no standard definition of abusive language, which often leads to inconsistent annotations. As a consequence, it is difficult to draw cross-domain conclusions, share datasets, or use models for other abusive social media language tasks. Furthermore, our manual inspection of a random sample of each dataset revealed controversial examples. We highlight challenges in data annotation by discussing those examples, and present common problems in the annotation process, such as contradictory annotations and missing context information. Finally, to complement our theoretical work, we conduct generalization experiments on three German datasets.
引用
收藏
页码:1561 / 1585
页数:25
相关论文
共 50 条
  • [31] Reflection of the problem of language teaching on the internet of Tatarstan's media and social networks
    Guseinova, Aigul A.
    Zayni, Rezeda L.
    AMAZONIA INVESTIGA, 2018, 7 (15): : 139 - 143
  • [32] Emergence of Lyme disease as a social problem: analysis of discourse using the media content
    Pascal, Clelia
    Arquembourg, Jocelyne
    Vorilhon, Philippe
    Lesens, Olivier
    EUROPEAN JOURNAL OF PUBLIC HEALTH, 2020, 30 (03): : 504 - 510
  • [34] THE LANGUAGE FEATURES OF SOCIAL MEDIA
    Jafarov, Yedgar
    REVISTA GENERO & DIREITO, 2020, 9 (03): : 954 - 973
  • [35] Sentiment Analysis of Social Media Content in Pashto Language using Deep Learning Algorithms
    Iqbal, Saqib
    Khan, Farhad
    Khan, Hikmat Ullah
    Iqba, Tassawar
    Shah, Jamal Hussain
    JOURNAL OF INTERNET TECHNOLOGY, 2022, 23 (07): : 1669 - 1677
  • [36] Generating content for social media
    Henstridge, Cat
    IN PRACTICE, 2012, 34 (06) : 362 - 365
  • [37] VISUALIZATION OF SOCIAL MEDIA CONTENT
    Racek, Jaroslav
    Parilova, Tereza
    Toth, Dalibor
    SOFTWARE DEVELOPMENT 2012, 2012, : 89 - 95
  • [38] Social media clinical content
    M. Dorri
    British Dental Journal, 2022, 233 : 364 - 364
  • [39] Social media clinical content
    Dorri, M.
    BRITISH DENTAL JOURNAL, 2022, 233 (05) : 364 - 364
  • [40] Abusive Expressions in The Use of Implicature on Social Media During Election Season
    Rofik, Nursyafieqa 'Ifwat Mohmod
    Osman, Maizura
    PERTANIKA JOURNAL OF SOCIAL SCIENCE AND HUMANITIES, 2024, 32 : 139 - 160