IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context

被引:0
|
作者
Sahoo, Nihar Ranja [1 ]
Beria, Gyana Prakash [1 ]
Bhattacharyya, Pushpak [1 ]
机构
[1] Indian Inst Technol, CFILT, Bombay, India
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hate speech (HS) is a growing concern in many parts of the world, including India, where it has led to numerous instances of violence and discrimination. The development of effective counter-narratives (CNs) is a critical step in combating hate speech, but there is a lack of research in this area, especially in non-English languages. In this paper, we introduce a new dataset, IndicCONAN, of counter-narratives against hate speech in Hindi and Indian English. We propose a scalable human-in-the-loop approach for generating counter-narratives by an auto-regressive language model through machine generation - human correction cycle, where the model uses augmented data from previous cycles to generate new training samples. These newly generated samples are then reviewed and edited by annotators, leading to further model refinement. The dataset consists of over (2) over tilde ,500 examples of counter-narratives each in both English and Hindi corresponding to various hate speeches in the Indian context. We also present a framework for generating CNs conditioned on specific CN type with a mean perplexity of 3.85 for English and 3.70 for Hindi, a mean toxicity score of 0.04 for English and 0.06 for Hindi, and a mean diversity of 0.08 for English and 0.14 for Hindi. Our dataset and framework provide valuable resources for researchers and practitioners working to combat hate speech in the Indian context.
引用
收藏
页码:22313 / 22321
页数:9
相关论文
共 50 条
  • [21] MULTILINGUAL CODE-MIXED SENTIMENT ANALYSIS IN HATE SPEECH
    Ranjan, Tulika
    Singh, Anish
    Kumari, Rina
    Swain, Sujata
    Bandyopadhyay, Anjan
    Parida, Ajaya kumar
    SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2023, 24 (04): : 873 - 882
  • [22] MULTILINGUAL PHONETIC DATASET FOR LOW RESOURCE SPEECH RECOGNITION
    Li, Xinjian
    Mortensen, David R.
    Metze, Florian
    Black, Alan W.
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 6958 - 6962
  • [23] A survey of hate speech detection in Indian languages
    Nandi, Arpan
    Sarkar, Kamal
    Mallick, Arjun
    De, Arkadeep
    SOCIAL NETWORK ANALYSIS AND MINING, 2024, 14 (01)
  • [24] HATE SPEECH IN CONTEXT - THE CASE OF VERBAL THREATS
    NOCKLEBY, JT
    BUFFALO LAW REVIEW, 1994, 42 (03): : 653 - 713
  • [25] Dataset Linking in a Multilingual Linked Open Data Context
    Beyene, Melkamu
    Portier, Pierre-Edouard
    Atnafu, Solomon
    Calabretto, Sylvie
    PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON MANAGEMENT OF DIGITAL ECOSYSTEMS (MEDES 2016), 2016, : 149 - 157
  • [26] MLHS-CGCapNet: A Lightweight Model for Multilingual Hate Speech Detection
    Kousar, Abida
    Ahmad, Jameel
    Ijaz, Khalid
    Yousef, Amr
    Ahmed Shaikh, Zaffar
    Khosa, Ikramullah
    Chavali, Durga
    Anjum, Mohd
    IEEE ACCESS, 2024, 12 : 106631 - 106644
  • [27] Hate speech detection on multilingual twitter using convolutional neural networks
    Elouali A.
    Elberrichi Z.
    Elouali N.
    Elouali, Aya (n.elouali@esi-sba.dz), 1600, International Information and Engineering Technology Association (34): : 81 - 88
  • [28] Multilevel Hate Speech Classification Based on Multilingual Case-Law
    Palmirani, Monica
    Catizone, Chiara
    Venditti, Giulia
    Sapienza, Salvatore
    LEGAL KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 379 : 317 - 322
  • [29] EnsMulHateCyb: Multilingual hate speech and cyberbully detection in online social media
    Mahajan, Esshaan
    Mahajan, Hemaank
    Kumar, Sanjay
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 236
  • [30] A curated dataset for hate speech detection on social media text
    Mody, Devansh
    Huang, YiDong
    de Oliveira, Thiago Eustaquio Alves
    DATA IN BRIEF, 2023, 46