A multilingual, multimodal dataset of aggression and bias: the ComMA dataset

被引:1
|
作者
Kumar, Ritesh [3 ,4 ]
Ratan, Shyam [4 ]
Singh, Siddharth [4 ]
Nandi, Enakshi [2 ]
Devi, Laishram Niranjana [2 ]
Bhagat, Akash [1 ]
Dawer, Yogesh [4 ]
Lahiri, Bornini [1 ]
Bansal, Akanksha [2 ]
机构
[1] Indian Inst Technol, Dept Humanities & Social Sci, Kharagpur, India
[2] Panlingua Language Proc LLP, New Delhi, India
[3] Council Strateg & Def Res, Div Artificial Intelligence & Linguist, Delhi, India
[4] UnReaL TecE LLP, Agra, India
关键词
Aggression; Bias; Meitei; Bangla; Hindi; Offensive language; Abusive language; Discursive method; HATE SPEECH;
D O I
10.1007/s10579-023-09696-7
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages-Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and 'hard' sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.
引用
收藏
页码:757 / 837
页数:81
相关论文
共 50 条
  • [31] Pohang canal dataset: A multimodal maritime dataset for autonomous navigation in restricted waters
    Chung, Dongha
    Kim, Jonghwi
    Lee, Changyu
    Kim, Jinwhan
    [J]. INTERNATIONAL JOURNAL OF ROBOTICS RESEARCH, 2023, 42 (12): : 1104 - 1114
  • [32] MEED: A Multimodal Event Extraction Dataset
    Wang, Shuo
    Zheng, Qiushuo
    Su, Zherong
    Na, Chongning
    Qi, Guilin
    [J]. KNOWLEDGE GRAPH AND SEMANTIC COMPUTING: KNOWLEDGE GRAPH EMPOWERS NEW INFRASTRUCTURE CONSTRUCTION, 2021, 1466 : 288 - 294
  • [33] Entheos: A Multimodal Dataset for Studying Enthusiasm
    Viegas, Carla
    Alikhani, Malihe
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2047 - 2060
  • [34] MEDIC: A Multimodal Empathy Dataset in Counseling
    Zhu, Zhouan
    Li, Chenguang
    Pan, Jicai
    Li, Xin
    Xiao, Yufei
    Chang, Yanan
    Zheng, Feiyi
    Wang, Shangfei
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6054 - 6062
  • [35] A Multimodal Dataset for Mixed Emotion Recognition
    Yang, Pei
    Liu, Niqi
    Liu, Xinge
    Shu, Yezhi
    Ji, Wenqi
    Ren, Ziqi
    Sheng, Jenny
    Yu, Minjing
    Yi, Ran
    Zhang, Dan
    Liu, Yong-Jin
    [J]. SCIENTIFIC DATA, 2024, 11 (01)
  • [36] StressID: a Multimodal Dataset for Stress Identification
    Chaptoukaev, Hava
    Strizhkova, Valeriya
    Panariello, Michele
    D'Alpaos, Bianca
    Reka, Aglind
    Manera, Valeria
    Thummler, Susanne
    Ismailova, Esma
    Evans, Nicholas
    Bremond, Francois
    Todisco, Massimiliano
    Zuluaga, Maria A.
    Ferrari, Laura M.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [37] 3MASSIV Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos
    Gupta, Vikram
    Mittal, Trisha
    Mathur, Puneet
    Mishra, Vaibhav
    Maheshwari, Mayank
    Bera, Aniket
    Mukherjee, Debdoot
    Manocha, Dinesh
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21032 - 21043
  • [38] MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset
    Nielsen, Dan S.
    McConville, Ryan
    [J]. PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 3141 - 3153
  • [39] MuSE: a Multimodal Dataset of Stressed Emotion
    Jaiswal, Mimansa
    Bara, Cristian-Paul
    Luo, Yuanhang
    Burzo, Mihai
    Mihalcea, Rada
    Provost, Emily Mower
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1499 - 1510
  • [40] MultiMET: A Multimodal Dataset for Metaphor Understanding
    Zhang, Dongyu
    Zhang, Minghao
    Zhang, Heting
    Yang, Liang
    Lin, Hongfei
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, 2021, : 3214 - 3225