IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context

被引:0
|
作者
Sahoo, Nihar Ranja [1 ]
Beria, Gyana Prakash [1 ]
Bhattacharyya, Pushpak [1 ]
机构
[1] Indian Inst Technol, CFILT, Bombay, India
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hate speech (HS) is a growing concern in many parts of the world, including India, where it has led to numerous instances of violence and discrimination. The development of effective counter-narratives (CNs) is a critical step in combating hate speech, but there is a lack of research in this area, especially in non-English languages. In this paper, we introduce a new dataset, IndicCONAN, of counter-narratives against hate speech in Hindi and Indian English. We propose a scalable human-in-the-loop approach for generating counter-narratives by an auto-regressive language model through machine generation - human correction cycle, where the model uses augmented data from previous cycles to generate new training samples. These newly generated samples are then reviewed and edited by annotators, leading to further model refinement. The dataset consists of over (2) over tilde ,500 examples of counter-narratives each in both English and Hindi corresponding to various hate speeches in the Indian context. We also present a framework for generating CNs conditioned on specific CN type with a mean perplexity of 3.85 for English and 3.70 for Hindi, a mean toxicity score of 0.04 for English and 0.06 for Hindi, and a mean diversity of 0.08 for English and 0.14 for Hindi. Our dataset and framework provide valuable resources for researchers and practitioners working to combat hate speech in the Indian context.
引用
收藏
页码:22313 / 22321
页数:9
相关论文
共 50 条
  • [31] ETHOS: a multi-label hate speech detection dataset
    Mollas, Ioannis
    Chrysopoulou, Zoe
    Karlos, Stamatis
    Tsoumakas, Grigorios
    COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (06) : 4663 - 4678
  • [32] ETHOS: a multi-label hate speech detection dataset
    Ioannis Mollas
    Zoe Chrysopoulou
    Stamatis Karlos
    Grigorios Tsoumakas
    Complex & Intelligent Systems, 2022, 8 : 4663 - 4678
  • [33] Hate Speech Detection in the Indonesian Language: A Dataset and Preliminary Study
    Alfina, Ika
    Mulia, Rio
    Fanany, Mohamad Ivan
    Ekanata, Yudo
    2017 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTER SCIENCE AND INFORMATION SYSTEMS (ICACSIS), 2017, : 233 - 237
  • [34] T-HSAB: A Tunisian Hate Speech and Abusive Dataset
    Haddad, Hatem
    Mulki, Hala
    Oueslati, Asma
    ARABIC LANGUAGE PROCESSING: FROM THEORY TO PRACTICE, ICALP 2019, 2019, 1108 : 251 - 263
  • [35] A Review on Multilingual Document Analysis in Indian Context
    Manjula, S.
    Hegadi, Ravindra S.
    PROCEEDINGS OF THE 2016 2ND INTERNATIONAL CONFERENCE ON APPLIED AND THEORETICAL COMPUTING AND COMMUNICATION TECHNOLOGY (ICATCCT), 2016, : 519 - 522
  • [36] Hate Speech and Counter Speech Detection: Conversational Context Does Matter
    Yu, Xinchen
    Blanco, Eduardo
    Hong, Lingzi
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5918 - 5930
  • [37] MLS: A Large-Scale Multilingual Dataset for Speech Research
    Pratap, Vineel
    Xu, Qiantong
    Sriram, Anuroop
    Synnaeve, Gabriel
    Collobert, Ronan
    INTERSPEECH 2020, 2020, : 2757 - 2761
  • [38] Combating Hate Speech at the Local Level: A Comparison of East Asian and European Approaches
    Wolman, Andrew
    NORDIC JOURNAL OF HUMAN RIGHTS, 2019, 37 (02) : 87 - 104
  • [39] 'Is This a Hate Speech?' The Difficulty in Combating Radicalisation in Coded Communications on Social media Platforms
    Farrand, Benjamin
    EUROPEAN JOURNAL ON CRIMINAL POLICY AND RESEARCH, 2023, 29 (03) : 477 - 493
  • [40] ‘Is This a Hate Speech?’ The Difficulty in Combating Radicalisation in Coded Communications on Social media Platforms
    Benjamin Farrand
    European Journal on Criminal Policy and Research, 2023, 29 : 477 - 493