The ComMA Dataset V0.2: Annotating Aggression and Bias in Multilingual Social Media Discourse

被引：0

作者：

Kumar, Ritesh ^{[1
]}

Ratan, Shyam ^{[1
]}

Singh, Siddharth ^{[1
]}

Nandi, Enakshi ^{[2
]}

Devi, Laishram Niranjana ^{[2
]}

Bhagat, Akash ^{[3
]}

Dawer, Yogesh ^{[1
]}

Lahiri, Bornini ^{[3
]}

Bansal, Akanksha ^{[2
]}

Ojha, Atul Kr. ^{[2
,4
]}

机构：

[1] Dr Bhimrao Ambedkar Univ, Agra, India

[2] Panlingua Language Proc LLP, New Delhi, India

[3] Indian Inst Technol Kharagpur, Kharagpur, India

[4] Natl Univ Ireland Galway, DSI, Galway, Ireland

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

关键词：

aggression; bias; Meitei; Bangla; Hindi; Tagset;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

In this paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the "context" in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the "type" of discursive role that the comment is performing with respect to the previous comment. The initial dataset, being discussed here consists of a total 59,152 annotated comments in four languages Meitei, Bangla, Hindi, and Indian English - collected from various social media platforms such as YouTube, Facebook, Twitter and Telegram. As is usual on social media websites, a large number of these comments are multilingual, mostly code-mixed with English. The paper gives a detailed description of the tagset being used for annotation and also the process of developing a multi-label, fine-grained tagset that has been used for marking comments with aggression and bias of various kinds including sexism (called gender bias in the tagset), religious intolerance (called communal bias in the tagset), class/caste bias and ethnic/racial bias. We also define and discuss the tags that have been used for marking the different discursive role being performed through the comments, such as attack, defend, etc. Finally we present a basic statistical analysis of the dataset. The dataset is being incrementally made publicly available on the project website

引用

页码：4149 / 4161

页数：13

共 12 条

[1] A multilingual, multimodal dataset of aggression and bias: the ComMA dataset
Kumar, Ritesh
Ratan, Shyam
Singh, Siddharth
Nandi, Enakshi
Devi, Laishram Niranjana
Bhagat, Akash
Dawer, Yogesh
Lahiri, Bornini
Bansal, Akanksha
[J]. LANGUAGE RESOURCES AND EVALUATION, 2024, 58 (02) : 757 - 837
[2] Aggression Detection on Multilingual Social Media Text
Si, Shukrity
Datta, Anisha
Banerjee, Somnath
Naskar, Sudip Kumar
[J]. 2019 10TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2019,
[3] A Multilingual Dataset of Racial Stereotypes in Social Media Conversational Threads
Bourgeade, Tom
Cignarella, Alessandra Teresa
Frenda, Simona
Laurent, Mario
Schmeisser-Nieto, Wolfgang S.
Benamara, Farah
Bosco, Cristina
Moriceau, Veronique
Patti, Viviana
Taule, Mariona
[J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 686 - 696
[4] Flood Detection in Social Media Using Multimodal Fusion on Multilingual Dataset
Jony, Rabiul Islam
Woodley, Alan
Perrin, Dimitri
[J]. 2021 INTERNATIONAL CONFERENCE ON DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA 2021), 2021, : 566 - 573
[5] Dataset Creation from Multilingual Data of Social Media: Challenges and Consequences
Ullah, Mohammad Aman
Azman, Norhidayah
Zaki, Zulkifly Mohd
Islam, Md Monirul
[J]. PROCEEDINGS OF 2020 6TH IEEE INTERNATIONAL WOMEN IN ENGINEERING (WIE) CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING (WIECON-ECE 2020), 2020, : 296 - 299
[6] TweetPap: A Dataset to Study the Social Media Discourse of Scientific Papers
Jain, Naman
Singh, Mayank
[J]. 2021 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2021), 2021, : 328 - 329
[7] Aggression, Disempowerment, and Feminism in the "Scum Men" Discourse on Chinese Social Media
Yiwei, Du
[J]. CRITICAL ARTS-SOUTH-NORTH CULTURAL AND MEDIA STUDIES, 2024,
[8] Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis
Leite, Joao A.
Silva, Diego F.
Bontcheva, Kalina
Scarton, Carolina
[J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 914 - 924
[9] Patient Care, Information, Communication and Social Media Influencing Bias-A Discourse
Procter, Paula M.
[J]. INFORMATICS-BASEL, 2021, 8 (02):
[10] 3MASSIV Multilingual, Multimodal and Multi-Aspect dataset of Social Media Short Videos
Gupta, Vikram
Mittal, Trisha
Mathur, Puneet
Mishra, Vaibhav
Maheshwari, Mayank
Bera, Aniket
Mukherjee, Debdoot
Manocha, Dinesh
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 21032 - 21043

← 1 2 →