A multilingual, multimodal dataset of aggression and bias: the ComMA dataset

被引:1
|
作者
Kumar, Ritesh [3 ,4 ]
Ratan, Shyam [4 ]
Singh, Siddharth [4 ]
Nandi, Enakshi [2 ]
Devi, Laishram Niranjana [2 ]
Bhagat, Akash [1 ]
Dawer, Yogesh [4 ]
Lahiri, Bornini [1 ]
Bansal, Akanksha [2 ]
机构
[1] Indian Inst Technol, Dept Humanities & Social Sci, Kharagpur, India
[2] Panlingua Language Proc LLP, New Delhi, India
[3] Council Strateg & Def Res, Div Artificial Intelligence & Linguist, Delhi, India
[4] UnReaL TecE LLP, Agra, India
关键词
Aggression; Bias; Meitei; Bangla; Hindi; Offensive language; Abusive language; Discursive method; HATE SPEECH;
D O I
10.1007/s10579-023-09696-7
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages-Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English. This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and 'hard' sets of instances which makes it a good dataset for training and testing new techniques for aggressive and abusive language classification.
引用
收藏
页码:757 / 837
页数:81
相关论文
共 50 条
  • [1] The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse
    Kumar, Ritesh
    Ratan, Shyam
    Singh, Siddharth
    Nandi, Enakshi
    Devi, Laishram Niranjana
    Bhagat, Akash
    Dawer, Yogesh
    Lahiri, Bornini
    Bansal, Akanksha
    Ojha, Atul Kr.
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4149 - 4161
  • [2] Multilingual Image Corpus - Towards a Multimodal and Multilingual Dataset
    Koeva, Svetla
    Stoyanova, Ivelina
    Kralev, Jordan
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1509 - 1518
  • [3] MultiSubs: A Large-scale Multimodal and Multilingual Dataset
    Wang, Josiah
    Figueiredo, Josiel
    Specia, Lucia
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6776 - 6785
  • [4] The Multimodal Dataset of Negative Affect and Aggression: A Validation Study
    Lefter, Iulia
    Fitrianie, Siska
    [J]. ICMI'18: PROCEEDINGS OF THE 20TH ACM INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, 2018, : 376 - 383
  • [5] Flood Detection in Social Media Using Multimodal Fusion on Multilingual Dataset
    Jony, Rabiul Islam
    Woodley, Alan
    Perrin, Dimitri
    [J]. 2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 566 - 573
  • [6] LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization
    Nguyen, Laura
    Scialom, Thomas
    Piwowarski, Benjamin
    Staiano, Jacopo
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 636 - 651
  • [7] A Multilingual Handwritten Character Dataset: T-H-E Dataset
    Bartos, Gaye Ediboglu
    Hoscan, Yasar
    Kauer, Andras
    Hajnal, Eva
    [J]. ACTA POLYTECHNICA HUNGARICA, 2020, 17 (09) : 141 - 160
  • [8] WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
    Srinivasan, Krishna
    Raman, Karthik
    Chen, Jiecao
    Bendersky, Michael
    Najork, Marc
    [J]. SIGIR '21 - PROCEEDINGS OF THE 44TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2021, : 2443 - 2449
  • [9] A Dataset and Baselines for Multilingual Reply Suggestion
    Zhang, Mozhi
    Wang, Wei
    Deb, Budhaditya
    Zheng, Guoqing
    Shokouhi, Milad
    Awadallah, Ahmed Hassan
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 1207 - 1220
  • [10] Slovak Dataset for Multilingual Question Answering
    Hladek, Daniel
    Stas, Jan
    Juhar, Jozef
    Koctur, Tomas
    [J]. IEEE ACCESS, 2023, 11 : 32869 - 32881