IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context

被引：0

作者：

Sahoo, Nihar Ranja ^{[1
]}

Beria, Gyana Prakash ^{[1
]}

Bhattacharyya, Pushpak ^{[1
]}

机构：

[1] Indian Inst Technol, CFILT, Bombay, India

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20 | 2024年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Hate speech (HS) is a growing concern in many parts of the world, including India, where it has led to numerous instances of violence and discrimination. The development of effective counter-narratives (CNs) is a critical step in combating hate speech, but there is a lack of research in this area, especially in non-English languages. In this paper, we introduce a new dataset, IndicCONAN, of counter-narratives against hate speech in Hindi and Indian English. We propose a scalable human-in-the-loop approach for generating counter-narratives by an auto-regressive language model through machine generation - human correction cycle, where the model uses augmented data from previous cycles to generate new training samples. These newly generated samples are then reviewed and edited by annotators, leading to further model refinement. The dataset consists of over (2) over tilde ,500 examples of counter-narratives each in both English and Hindi corresponding to various hate speeches in the Indian context. We also present a framework for generating CNs conditioned on specific CN type with a mean perplexity of 3.85 for English and 3.70 for Hindi, a mean toxicity score of 0.04 for English and 0.06 for Hindi, and a mean diversity of 0.08 for English and 0.14 for Hindi. Our dataset and framework provide valuable resources for researchers and practitioners working to combat hate speech in the Indian context.

引用

页码：22313 / 22321

页数：9

共 50 条

[1] Navigating the Virtual Realm of Hate: Analysis of Policies Combating Online Hate Speech in the Italian-European Context
Battista, Daniele
Uva, Gabriele
LAW TECHNOLOGY AND HUMANS, 2024, 6 (01): : 48 - 58
[2] CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech
Chung, Yi-Ling
Kuzmenko, Elizaveta
Tekiroglu, Serra Sinem
Guerini, Marco
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2819 - 2829
[3] Multilingual and Multimodal Hate Speech Analysis in Twitter
Liz De la Pena Sarracen, Gretel
WSDM '21: PROCEEDINGS OF THE 14TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2021, : 1109 - 1110
[4] A Deep Dive into Multilingual Hate Speech Classification
Aluru, Sai Saketh
Mathew, Binny
Saha, Punyajoy
Mukherjee, Animesh
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: APPLIED DATA SCIENCE AND DEMO TRACK, ECML PKDD 2020, PT V, 2021, 12461 : 423 - 439
[5] A Multilingual Evaluation for Online Hate Speech Detection
Corazza, Michele
Menini, Stefano
Cabrio, Elena
Tonelli, Sara
Villata, Serena
ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2020, 20 (02)
[6] CMU WILDERNESS MULTILINGUAL SPEECH DATASET
Black, Alan W.
2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 5971 - 5975
[7] A Dialogic Approach to Combating Hate Speech on College Campuses
Hatfield, Katherine L.
Schafer, Kellie
Stroup, Kristopher A.
ATLANTIC JOURNAL OF COMMUNICATION, 2005, 13 (01) : 41 - 55
[8] Multilingual Hate Speech Detection: Innovations in Optimized Deep Learning for English and Arabic Hate Speech Detection
Hassan AL-Sukhani
Qusay Bsoul
Abdelrahman H. Elhawary
Ziad M. Nasr
Ahmed E. Mansour
Radwan M. Batyha
Basma S. Alqadi
Jehad Saad Alqurni
Hayat Alfagham
Magda M. Madbouly
SN Computer Science, 6 (3)
[9] Multilingual and Multi-Aspect Hate Speech Analysis
Ousidhoum, Nedjma
Lin, Zizheng
Zhang, Hongming
Song, Yangqiu
Yeung, Dit-Yan
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4675 - 4684
[10] A Turkish Hate Speech Dataset and Detection System
Beyhan, Fatih
Carik, Buse
Arin, Inanc
Terzioglu, Aysecan
Yanikoglu, Berrin
Yeniterzi, Reyyan
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4177 - 4185

← 1 2 3 4 5 →