DART: A Large Dataset of Dialectal Arabic Tweets

被引:0
|
作者
Alsarsour, Israa [1 ]
Mohamed, Esraa [1 ]
Suwaileh, Reem [1 ]
Elsayed, Tamer [1 ]
机构
[1] Qatar Univ, Comp Sci & Engn Dept, Doha, Qatar
关键词
Arabic; Multi-Dialect; Twitter; Crowdsourcing; Annotations; Corpus;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In this paper, we present a new large manually-annotated multi-dialect dataset of Arabic tweets that is publicly available. The Dialectal ARabic Tweets (DART) dataset has about 25K tweets that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf, and Iraqi. The paper outlines the pipeline of constructing the dataset from crawling tweets that match a list of dialect phrases to annotating the tweets by the crowd. We also touch some challenges that we face during the process. We evaluate the quality of the dataset from two perspectives: the inter-annotator agreement and the accuracy of the final labels. Results show that both measures were substantially high for the Egyptian, Gulf, and Levantine dialect groups, but lower for the Iraqi and Maghrebi dialects, which indicates the difficulty of identifying those two dialects manually and hence automatically.
引用
收藏
页码:3666 / 3670
页数:5
相关论文
共 50 条
  • [41] Clustering Arabic Tweets for Sentiment Analysis
    Abuaiadah, Diab
    Rajendran, Dileep
    Jarrar, Mustafa
    [J]. 2017 IEEE/ACS 14TH INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2017, : 449 - 456
  • [42] Detecting Emotions in English and Arabic Tweets
    Ahmad, Tariq
    Ramsay, Allan
    Ahmed, Hanady
    [J]. INFORMATION, 2019, 10 (03)
  • [43] Effective multi-dialectal arabic POS tagging
    Darwish, Kareem
    Attia, Mohammed
    Mubarak, Hamdy
    Samih, Younes
    Abdelali, Ahmed
    Marquez, Lluis
    Eldesouki, Mohamed
    Kallmeyer, Laura
    [J]. NATURAL LANGUAGE ENGINEERING, 2020, 26 (06) : 677 - 690
  • [44] DART: a Dataset of Arguments and their Relations on Twitter
    Bosc, Tom
    Cabrio, Elena
    Villata, Serena
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1258 - 1263
  • [45] Arabic punctuation dataset
    Yagi, Sane
    Elnagar, Ashraf
    Yaghi, Esra
    [J]. DATA IN BRIEF, 2024, 53
  • [46] Linguistic suppleance in dialectal Arabic: view of a conversational dynamic
    Guella, Noureddine
    [J]. ARABICA, 2010, 57 (04) : 477 - 490
  • [47] Analysis of Dialectal Influence in Pan-Arabic ASR
    Nallasamy, Udhyakumar
    Garbus, Michael
    Metze, Florian
    Jin, Qin
    Schaaf, Thomas
    Schultz, Tanja
    [J]. 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, 2011, : 1732 - +
  • [48] ON SOME ARABIC DIALECTAL FEATURES PARALLELED BY HEBREW AND ARAMAIC
    BLAU, J
    [J]. JEWISH QUARTERLY REVIEW, 1985, 76 (01) : 5 - 12
  • [49] Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak Supervision
    Helmstetter, Stefan
    Paulheim, Heiko
    [J]. FUTURE INTERNET, 2021, 13 (05):
  • [50] Systematic Literature Review of Dialectal Arabic: Identification and Detection
    Elnagar, Ashraf
    Yagi, Sane M.
    Nassif, Ali Bou
    Shahin, Ismail
    Salloum, Said A.
    [J]. IEEE ACCESS, 2021, 9 : 31010 - 31042