PersianQuAD: The Native Question Answering Dataset for the Persian Language

被引:0
|
作者
Kazemi, Arefeh [1 ]
Mozafari, Jamshid [2 ]
Nematbakhsh, Mohammad Ali [2 ]
机构
[1] Univ Isfahan, Fac Foreign Languages, Dept Linguist, Esfahan 8174673441, Iran
[2] Univ Isfahan, Fac Comp Engn, Big Data Res Grp, Esfahan 8174673441, Iran
来源
IEEE ACCESS | 2022年 / 10卷
关键词
Internet; Online services; Encyclopedias; Training; Task analysis; Machine translation; Buildings; Dataset; deep learning; natural language processing; Persian; question answering; machine reading comprehension;
D O I
10.1109/ACCESS.2022.3157289
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Developing Question Answering systems (QA) is one of the main goals in Artificial Intelligence. With the advent of Deep Learning (DL) techniques, QA systems have witnessed significant advances. Although DL performs very well on QA, it requires a considerable amount of annotated data for training. Many annotated datasets have been built for the QA task; most of them are exclusively in English. In order to address the need for a high-quality QA dataset in the Persian language, we present PersianQuAD, the native QA dataset for the Persian language. We create PersianQuAD in four steps: 1) Wikipedia article selection, 2) question-answer collection, 3) three-candidates test set preparation, and 4) Data Quality Monitoring. PersianQuAD consists of approximately 20,000 questions and answers made by native annotators on a set of Persian Wikipedia articles. The answer to each question is a segment of the corresponding article text. To better understand PersianQuAD and ensure its representativeness, we analyze PersianQuAD and show it contains questions of varying types and difficulties. We also present three versions of a deep learning-based QA system trained with PersianQuAD. Our best system achieves an F1 score of 82.97% which is comparable to that of QA systems on English SQuAD, made by the Stanford University. This shows that PersianQuAD performs well for training deep-learning-based QA systems. Human performance on PersianQuAD is significantly better (96.49%), demonstrating that PersianQuAD is challenging enough and there is still plenty of room for future improvement. PersianQuAD and all QA models implemented in this paper are freely available.
引用
收藏
页码:26045 / 26057
页数:13
相关论文
共 50 条
  • [1] PQuAD: A Persian question answering dataset
    Darvishi, Kasra
    Shahbodaghkhan, Newsha
    Abbasiantaeb, Zahra
    Momtazi, Saeedeh
    [J]. COMPUTER SPEECH AND LANGUAGE, 2023, 80
  • [2] PerCQA: Persian Community Question Answering Dataset
    Jamali, Naghme
    Yaghoobzadeh, Yadollah
    Faili, Heshaam
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6083 - 6092
  • [3] Project PIAF: Building a Native French Question-Answering Dataset
    Keraron, Rachel
    Lancrenon, Guillaume
    Bras, Mathilde
    Allary, Frederic
    Moyse, Gilles
    Scialom, Thomas
    Soriano-Morales, Edmundo-Pavel
    Staiano, Jacopo
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5481 - 5490
  • [4] UDDIPOK: A reading comprehension based question answering dataset in Bangla language
    Aurpa, Tanjim Taharat
    Ahmed, Md Shoaib
    Rifat, Richita Khandakar
    Anwar, Md. Musfique
    Ali, A. B. M. Shawkat
    [J]. DATA IN BRIEF, 2023, 47
  • [5] HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation
    Cheng, Zhoujun
    Dong, Haoyu
    Wang, Zhiruo
    Jia, Ran
    Guo, Jiaqi
    Gao, Yan
    Han, Shi
    Lou, Jian-Guang
    Zhang, Dongmei
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1094 - 1110
  • [6] Automatic question answering for multiple stakeholders, the epidemic question answering dataset
    Travis R. Goodwin
    Dina Demner-Fushman
    Kyle Lo
    Lucy Lu Wang
    Hoa T. Dang
    Ian M. Soboroff
    [J]. Scientific Data, 9
  • [7] Automatic question answering for multiple stakeholders, the epidemic question answering dataset
    Goodwin, Travis R.
    Demner-Fushman, Dina
    Lo, Kyle
    Wang, Lucy Lu
    Dang, Hoa T.
    Soboroff, Ian M.
    [J]. SCIENTIFIC DATA, 2022, 9 (01)
  • [8] A Persian Medical Question Answering System
    Veisi, Hadi
    Shandi, Hamed Fakour
    [J]. INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2020, 29 (06)
  • [9] KenSwQuAD-A Question Answering Dataset for Swahili Low-resource Language
    Wanjawa, Barack W.
    Wanzare, Lilian D. A.
    Indede, Florence
    Mconyango, Owen
    Muchemi, Lawrence
    Ombui, Edward
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (04)
  • [10] FQuAD: French Question Answering Dataset
    d'Hoffschmidt, Martin
    Belblidia, Wacim
    Heinrich, Quentin
    Brendle, Tom
    Vidal, Maxime
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1193 - 1208