A Hierarchically-Labeled Portuguese Hate Speech Dataset

被引：0

作者：

Fortuna, Paula ^{[1
,3
]}

Rocha da Silva, Joao ^{[1
,2
]}

Soler-Company, Juan ^{[3
]}

Wanner, Leo ^{[3
,4
]}

Nunes, Sergio ^{[1
,2
]}

机构：

[1] Univ Porto, INESC TEC, Rua Dr Roberto Frias S-N, P-4200465 Porto, Portugal

[2] Univ Porto, FEUP, Rua Dr Roberto Frias S-N, P-4200465 Porto, Portugal

[3] Pompeu Fabra Univ, ETIC, NLP Grp, Barcelona, Spain

[4] Catalan Inst Res & Adv Studies ICREA, Barcelona, Spain

来源：

THIRD WORKSHOP ON ABUSIVE LANGUAGE ONLINE | 2019年

基金：

欧盟地平线“2020”;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Over the past years, the amount of online offensive speech has been growing steadily. To successfully cope with it, machine learning is applied. However, ML-based techniques require sufficiently large annotated datasets. In the last years, different datasets were published, mainly for English. In this paper, we present a new dataset for Portuguese, which has not been in focus so far. The dataset is composed of 5,668 tweets. For its annotation, we defined two different schemes used by annotators with different levels of expertise. First, non-experts annotated the tweets with binary labels ('hate' vs. 'no-hate'). Then, expert annotators classified the tweets following a fine-grained hierarchical multiple label scheme with 81 hate speech categories in total. The inter-annotator agreement varied from category to category, which reflects the insight that some types of hate speech are more subtle than others and that their detection depends on personal perception. The hierarchical annotation scheme is the main contribution of the presented work, as it facilitates the identification of different types of hate speech and their intersections. To demonstrate the usefulness of our dataset, we carried a baseline classification experiment with pre-trained word embeddings and LSTM on the binary classified data, with a state-of-the-art outcome.

引用

页码：94 / 104

页数：11

共 50 条

[1] A Turkish Hate Speech Dataset and Detection System
Beyhan, Fatih
Carik, Buse
Arin, Inanc
Terzioglu, Aysecan
Yanikoglu, Berrin
Yeniterzi, Reyyan
LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 4177 - 4185
[2] Towards an Organically Growing Hate Speech Dataset in Hate Speech Detection Systems in a Smart Mobility Application
Alsamman, Ahmad
Schmitz, Andreas
Wimmer, Maria A.
TOGETHER IN THE UNSTABLE WORLD: DIGITAL GOVERNMENT AND SOLIDARITY, 2023, : 36 - 43
[3] HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection
Mathew, Binny
Saha, Punyajoy
Yimam, Seid Muhie
Biemann, Chris
Goyal, Pawan
Mukherjee, Animesh
THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 14867 - 14875
[4] A Benchmark Dataset for Learning to Intervene in Online Hate Speech
Qian, Jing
Bethke, Anna
Liu, Yinyin
Belding, Elizabeth
Wang, William Yang
2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4755 - 4764
[5] Understanding hate speech: the HateInsights dataset and model interpretability
Arshad, Muhammad Umair
Shahzad, Waseem
PeerJ Computer Science, 2024, 10
[6] Using Cross Lingual Learning for Detecting Hate Speech in Portuguese
Firmino, Anderson Almeida
de Baptista, Claudio Souza
de Paiva, Anselmo Cardoso
DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2021, PT II, 2021, 12924 : 170 - 175
[7] A curated dataset for hate speech detection on social media text
Mody, Devansh
Huang, YiDong
de Oliveira, Thiago Eustaquio Alves
DATA IN BRIEF, 2023, 46
[8] ETHOS: a multi-label hate speech detection dataset
Mollas, Ioannis
Chrysopoulou, Zoe
Karlos, Stamatis
Tsoumakas, Grigorios
COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (06) : 4663 - 4678
[9] ETHOS: a multi-label hate speech detection dataset
Ioannis Mollas
Zoe Chrysopoulou
Stamatis Karlos
Grigorios Tsoumakas
Complex & Intelligent Systems, 2022, 8 : 4663 - 4678
[10] IndicCONAN: A Multilingual Dataset for Combating Hate Speech in Indian Context
Sahoo, Nihar Ranja
Beria, Gyana Prakash
Bhattacharyya, Pushpak
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 20, 2024, : 22313 - 22321

← 1 2 3 4 5 →