DART: A Large Dataset of Dialectal Arabic Tweets

被引:0
|
作者
Alsarsour, Israa [1 ]
Mohamed, Esraa [1 ]
Suwaileh, Reem [1 ]
Elsayed, Tamer [1 ]
机构
[1] Qatar Univ, Comp Sci & Engn Dept, Doha, Qatar
关键词
Arabic; Multi-Dialect; Twitter; Crowdsourcing; Annotations; Corpus;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we present a new large manually-annotated multi-dialect dataset of Arabic tweets that is publicly available. The Dialectal ARabic Tweets (DART) dataset has about 25K tweets that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi. The paper outlines the pipeline of constructing the dataset from crawling tweets that match a list of dialect phrases to annotating the tweets by the crowd. We also touch some challenges that we face during the process. We evaluate the quality of the dataset from two perspectives: the inter-annotator agreement and the accuracy of the final labels. Results show that both measures were substantially high for the Egyptian, Gulf, and Levantine dialect groups, but lower for the Iraqi and Maghrebi dialects, which indicates the difficulty of identifying those two dialects manually and hence automatically.
引用
收藏
页码:3666 / 3670
页数:5
相关论文
共 50 条
  • [1] Sentiment Analysis of Modern Standard Arabic and Egyptian Dialectal Arabic Tweets
    El-Naggar, Nadine
    El-Sonbaty, Yasser
    Abou El-Nasr, Mohamad
    [J]. 2017 COMPUTING CONFERENCE, 2017, : 880 - 887
  • [2] LASTD: A Manually Annotated and Tested Large Arabic Sentiment Tweets Dataset
    Elshakankery, Kariman
    Fayek, Magda
    Farouk, Mona
    [J]. 5TH INTERNATIONAL CONFERENCE ON INFORMATION SYSTEM AND DATA MINING (ICISDM 2021), 2021, : 62 - 66
  • [3] Building an Arabic Dialectal Diagnostic Dataset for Healthcare
    Mounsef, Jinane
    Hasib, Maheen
    Raza, Ali
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (07) : 859 - 868
  • [4] Dataset of Arabic spam and ham tweets
    Kaddoura, Sanaa
    Henno, Safaa
    [J]. DATA IN BRIEF, 2024, 52
  • [5] WASM: A Dataset for Hashtag Recommendation for Arabic Tweets
    Al-Shaibani, Maged S.
    Luqman, Hamzah
    Al-Ghofaily, Abdulaziz S.
    Al-Najim, Abdullatif A.
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2024, 49 (09) : 12131 - 12145
  • [6] arHateDetector: detection of hate speech from standard and dialectal Arabic Tweets
    Khezzar R.
    Moursi A.
    Al Aghbari Z.
    [J]. Discover Internet of Things, 3 (1):
  • [7] Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon
    Diab, Mona
    Al-Badrashiny, Mohamed
    Aminian, Maryam
    Attia, Mohammed
    Dasigi, Pradeep
    Elfardy, Heba
    Eskander, Ramy
    Habash, Nizar
    Hawwari, Abdelati
    Salloum, Wael
    [J]. LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 3782 - 3789
  • [8] Using Tweets and Emojis to Build TEAD: an Arabic Dataset for Sentiment Analysis
    Abdellaoui, Houssem
    Zrigui, Mounir
    [J]. COMPUTACION Y SISTEMAS, 2018, 22 (03): : 777 - 786
  • [9] Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations
    Elfardy, Heba
    Diab, Mona
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 371 - 378
  • [10] Conventional Orthography for Dialectal Arabic
    Habash, Nizar
    Diab, Mona
    Rambow, Owen
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 711 - 718