The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse

被引:0
|
作者
Kumar, Ritesh [1 ]
Ratan, Shyam [1 ]
Singh, Siddharth [1 ]
Nandi, Enakshi [2 ]
Devi, Laishram Niranjana [2 ]
Bhagat, Akash [3 ]
Dawer, Yogesh [1 ]
Lahiri, Bornini [3 ]
Bansal, Akanksha [2 ]
Ojha, Atul Kr. [2 ,4 ]
机构
[1] Dr Bhimrao Ambedkar Univ, Agra, India
[2] Panlingua Language Proc LLP, New Delhi, India
[3] Indian Inst Technol Kharagpur, Kharagpur, India
[4] Natl Univ Ireland Galway, DSI, Galway, Ireland
关键词
aggression; bias; Meitei; Bangla; Hindi; Tagset;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here consists of a total 59,152 annotated comments in four languages Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English. The paper gives a detailed description of the tagset being used for annotation and also the process of developing a multi-label, fine-grained tagset that has been used for marking comments with aggression and bias of various kinds including sexism (called gender bias in the tagset), religious intolerance (called communal bias in the tagset), class/caste bias and ethnic/racial bias. We also define and discuss the tags that have been used for marking the different discursive role being performed through the comments, such as attack, defend, etc. Finally we present a basic statistical analysis of the dataset. The dataset is being incrementally made publicly available on the project website
引用
收藏
页码:4149 / 4161
页数:13
相关论文
共 12 条
  • [1] A multilingual, multimodal dataset of aggression and bias: the ComMA dataset
    Kumar, Ritesh
    Ratan, Shyam
    Singh, Siddharth
    Nandi, Enakshi
    Devi, Laishram Niranjana
    Bhagat, Akash
    Dawer, Yogesh
    Lahiri, Bornini
    Bansal, Akanksha
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 757 - 837
  • [2] Aggression Detection on Multilingual Social Media Text
    Si, Shukrity
    Datta, Anisha
    Banerjee, Somnath
    Naskar, Sudip Kumar
    [J]. 2019 10TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2019,
  • [3] A Multilingual Dataset of Racial Stereotypes in Social Media Conversational Threads
    Bourgeade, Tom
    Cignarella, Alessandra Teresa
    Frenda, Simona
    Laurent, Mario
    Schmeisser-Nieto, Wolfgang S.
    Benamara, Farah
    Bosco, Cristina
    Moriceau, Veronique
    Patti, Viviana
    Taule, Mariona
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 686 - 696
  • [4] Flood Detection in Social Media Using Multimodal Fusion on Multilingual Dataset
    Jony, Rabiul Islam
    Woodley, Alan
    Perrin, Dimitri
    [J]. 2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 566 - 573
  • [5] Dataset Creation from Multilingual Data of Social Media: Challenges and Consequences
    Ullah, Mohammad Aman
    Azman, Norhidayah
    Zaki, Zulkifly Mohd
    Islam, Md Monirul
    [J]. PROCEEDINGS OF 2020 6TH IEEE INTERNATIONAL WOMEN IN ENGINEERING (WIE) CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (WIECON-ECE 2020), 2020, : 296 - 299
  • [6] TweetPap: A Dataset to Study the Social Media Discourse of Scientific Papers
    Jain, Naman
    Singh, Mayank
    [J]. 2021 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2021), 2021, : 328 - 329
  • [7] Aggression, Disempowerment, and Feminism in the "Scum Men" Discourse on Chinese Social Media
    Yiwei, Du
    [J]. CRITICAL ARTS-SOUTH-NORTH CULTURAL AND MEDIA STUDIES, 2024,
  • [8] Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis
    Leite, Joao A.
    Silva, Diego F.
    Bontcheva, Kalina
    Scarton, Carolina
    [J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 914 - 924
  • [9] Patient Care, Information, Communication and Social Media Influencing Bias-A Discourse
    Procter, Paula M.
    [J]. INFORMATICS-BASEL, 2021, 8 (02):
  • [10] 3MASSIV Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos
    Gupta, Vikram
    Mittal, Trisha
    Mathur, Puneet
    Mishra, Vaibhav
    Maheshwari, Mayank
    Bera, Aniket
    Mukherjee, Debdoot
    Manocha, Dinesh
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21032 - 21043