PQuAD: A Persian question answering dataset

被引:4
|
作者
Darvishi, Kasra [1 ]
Shahbodaghkhan, Newsha [1 ]
Abbasiantaeb, Zahra [1 ]
Momtazi, Saeedeh [1 ]
机构
[1] Amirkabir Univ Technol, Comp Engn Dept, Tehran Polytech, Tehran, Iran
来源
关键词
Machine reading comprehension; Natural language processing; Persian dataset; Question answering;
D O I
10.1016/j.csl.2023.101486
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We present the Persian Question Answering Dataset (PQuAD), a crowdsourced reading com-prehension dataset on Persian Wikipedia articles. It includes 80,000 questions along with their answers, with 25% of the questions being adversarially unanswerable. We examine various properties of the dataset to show the diversity and the level of its difficulty as a MRC benchmark. By releasing this dataset, we aim to ease research on Persian reading comprehension and the development of Persian question answering systems. Our experiments on different state-of-the-art pre-trained contextualized language models show 74.8% Exact Match (EM) and 87.6% F1-score that can be used as the baseline results for further research on Persian QA.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] PerCQA: Persian Community Question Answering Dataset
    Jamali, Naghme
    Yaghoobzadeh, Yadollah
    Faili, Heshaam
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6083 - 6092
  • [2] PersianQuAD: The Native Question Answering Dataset for the Persian Language
    Kazemi, Arefeh
    Mozafari, Jamshid
    Nematbakhsh, Mohammad Ali
    [J]. IEEE ACCESS, 2022, 10 : 26045 - 26057
  • [3] Automatic question answering for multiple stakeholders, the epidemic question answering dataset
    Travis R. Goodwin
    Dina Demner-Fushman
    Kyle Lo
    Lucy Lu Wang
    Hoa T. Dang
    Ian M. Soboroff
    [J]. Scientific Data, 9
  • [4] Automatic question answering for multiple stakeholders, the epidemic question answering dataset
    Goodwin, Travis R.
    Demner-Fushman, Dina
    Lo, Kyle
    Wang, Lucy Lu
    Dang, Hoa T.
    Soboroff, Ian M.
    [J]. SCIENTIFIC DATA, 2022, 9 (01)
  • [5] A Persian Medical Question Answering System
    Veisi, Hadi
    Shandi, Hamed Fakour
    [J]. INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2020, 29 (06)
  • [6] FQuAD: French Question Answering Dataset
    d'Hoffschmidt, Martin
    Belblidia, Wacim
    Heinrich, Quentin
    Brendle, Tom
    Vidal, Maxime
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1193 - 1208
  • [7] Slovak Dataset for Multilingual Question Answering
    Hladek, Daniel
    Stas, Jan
    Juhar, Jozef
    Koctur, Tomas
    [J]. IEEE ACCESS, 2023, 11 : 32869 - 32881
  • [8] VQuAnDa: Verbalization QUestion ANswering DAtaset
    Kacupaj, Endri
    Zafar, Hamid
    Lehmann, Jens
    Maleshkova, Maria
    [J]. SEMANTIC WEB (ESWC 2020), 2020, 12123 : 531 - 547
  • [9] LLQA - Lifelog Question Answering Dataset
    Tran, Ly-Duyen
    Thanh Cong Ho
    Lan Anh Pham
    Binh Nguyen
    Gurrin, Cathal
    Zhou, Liting
    [J]. MULTIMEDIA MODELING (MMM 2022), PT I, 2022, 13141 : 217 - 228
  • [10] Question and Answer Classification in Czech Question Answering Benchmark Dataset
    Kusnirakova, Dasa
    Medved, Marek
    Horak, Ales
    [J]. PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE (ICAART), VOL 2, 2019, : 701 - 706