IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context

被引:0
|
作者
Sahoo, Nihar Ranja [1 ]
Beria, Gyana Prakash [1 ]
Bhattacharyya, Pushpak [1 ]
机构
[1] Indian Inst Technol, CFILT, Bombay, India
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hate speech (HS) is a growing concern in many parts of the world, including India, where it has led to numerous instances of violence and discrimination. The development of effective counter-narratives (CNs) is a critical step in combating hate speech, but there is a lack of research in this area, especially in non-English languages. In this paper, we introduce a new dataset, IndicCONAN, of counter-narratives against hate speech in Hindi and Indian English. We propose a scalable human-in-the-loop approach for generating counter-narratives by an auto-regressive language model through machine generation - human correction cycle, where the model uses augmented data from previous cycles to generate new training samples. These newly generated samples are then reviewed and edited by annotators, leading to further model refinement. The dataset consists of over (2) over tilde ,500 examples of counter-narratives each in both English and Hindi corresponding to various hate speeches in the Indian context. We also present a framework for generating CNs conditioned on specific CN type with a mean perplexity of 3.85 for English and 3.70 for Hindi, a mean toxicity score of 0.04 for English and 0.06 for Hindi, and a mean diversity of 0.08 for English and 0.14 for Hindi. Our dataset and framework provide valuable resources for researchers and practitioners working to combat hate speech in the Indian context.
引用
收藏
页码:22313 / 22321
页数:9
相关论文
共 50 条
  • [1] Navigating the Virtual Realm of Hate: Analysis of Policies Combating Online Hate Speech in the Italian-European Context
    Battista, Daniele
    Uva, Gabriele
    LAW TECHNOLOGY AND HUMANS, 2024, 6 (01): : 48 - 58
  • [2] CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech
    Chung, Yi-Ling
    Kuzmenko, Elizaveta
    Tekiroglu, Serra Sinem
    Guerini, Marco
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2819 - 2829
  • [3] Multilingual and Multimodal Hate Speech Analysis in Twitter
    Liz De la Pena Sarracen, Gretel
    WSDM '21: PROCEEDINGS OF THE 14TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2021, : 1109 - 1110
  • [4] A Deep Dive into Multilingual Hate Speech Classification
    Aluru, Sai Saketh
    Mathew, Binny
    Saha, Punyajoy
    Mukherjee, Animesh
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: APPLIED DATA SCIENCE AND DEMO TRACK, ECML PKDD 2020, PT V, 2021, 12461 : 423 - 439
  • [5] A Multilingual Evaluation for Online Hate Speech Detection
    Corazza, Michele
    Menini, Stefano
    Cabrio, Elena
    Tonelli, Sara
    Villata, Serena
    ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2020, 20 (02)
  • [6] CMU WILDERNESS MULTILINGUAL SPEECH DATASET
    Black, Alan W.
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5971 - 5975
  • [7] A Dialogic Approach to Combating Hate Speech on College Campuses
    Hatfield, Katherine L.
    Schafer, Kellie
    Stroup, Kristopher A.
    ATLANTIC JOURNAL OF COMMUNICATION, 2005, 13 (01) : 41 - 55
  • [8] Multilingual Hate Speech Detection: Innovations in Optimized Deep Learning for English and Arabic Hate Speech Detection
    Hassan AL-Sukhani
    Qusay Bsoul
    Abdelrahman H. Elhawary
    Ziad M. Nasr
    Ahmed E. Mansour
    Radwan M. Batyha
    Basma S. Alqadi
    Jehad Saad Alqurni
    Hayat Alfagham
    Magda M. Madbouly
    SN Computer Science, 6 (3)
  • [9] Multilingual and Multi-Aspect Hate Speech Analysis
    Ousidhoum, Nedjma
    Lin, Zizheng
    Zhang, Hongming
    Song, Yangqiu
    Yeung, Dit-Yan
    2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4675 - 4684
  • [10] A Turkish Hate Speech Dataset and Detection System
    Beyhan, Fatih
    Carik, Buse
    Arin, Inanc
    Terzioglu, Aysecan
    Yanikoglu, Berrin
    Yeniterzi, Reyyan
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4177 - 4185