Slovak Dataset for Multilingual Question Answering

被引:0
|
作者
Hladek, Daniel [1 ]
Stas, Jan [1 ]
Juhar, Jozef [1 ]
Koctur, Tomas [2 ]
机构
[1] Tech Univ Kosice, Fac Elect Engn & Informat, Kosice 04200, Slovakia
[2] Deutsch Telekom IT & Telecommun Slovakia, Fac Elect Engn & Informat, Kosice 04001, Slovakia
关键词
Question answering (information retrieval); Machine translation; Annotations; Natural language processing; Text analysis; Learning systems; Crosslingual dataset; monolingual dataset; multilingual dataset; machine translation; neural language model; question answering; Slovak language; BENCHMARK; CZECH;
D O I
10.1109/ACCESS.2023.3262308
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
SK-QuAD is the first manually annotated dataset of questions and answers in Slovak. It consists of more than 91k factual questions and answers from various fields. Each question has an answer marked in the corresponding paragraph. It also contains negative examples in the form of "unanswered questions" and "plausible answers". The dataset is published free of charge for scientific use. We aim to contribute to the creation of Slovak or multilingual systems for generating an answer to a question in a natural language. The paper provides an overview of the existing datasets for question answering. It describes the annotation process and statistically analyzes the created content. The dataset expands the possibilities of training and evaluation of multilingual language models. Experiments show that the dataset achieves state-of-the-art results for Slovak and improves question answering for other languages in zero-shot learning. We compare the effect of machine-translated data with manually annotated. Additional data improve the modeling for low-resourced languages.
引用
收藏
页码:32869 / 32881
页数:13
相关论文
共 50 条
  • [1] Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering
    Carrino, Casimiro Pio
    Costa-jussa, Marta R.
    Fonollosa, Jose A. R.
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5515 - 5523
  • [2] MLPQ: A Dataset for Path Question Answering over Multilingual Knowledge Graphs
    Tan, Yiming
    Chen, Yongrui
    Qi, Guilin
    Li, Weizhuo
    Wang, Meng
    [J]. BIG DATA RESEARCH, 2023, 32
  • [3] Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering
    Gao, Haoyuan
    Mao, Junhua
    Zhou, Jie
    Huang, Zhiheng
    Wang, Lei
    Xu, Wei
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [4] VISCOUNTH: A Large-scale Multilingual Visual Question Answering Dataset for Cultural Heritage
    Becattini, Federico
    Bongini, Pietro
    Bulla, Luana
    Marinucci, Ludovica
    del Bimbo, Alberto
    Mongiovi, Misael
    Presutti, Valentina
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
  • [5] Towards Multilingual Neural Question Answering
    Loginova, Ekaterina
    Varanasi, Stalin
    Neumann, Guenter
    [J]. NEW TRENDS IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2018, 2018, 909 : 274 - 285
  • [6] Automatic question answering for multiple stakeholders, the epidemic question answering dataset
    Travis R. Goodwin
    Dina Demner-Fushman
    Kyle Lo
    Lucy Lu Wang
    Hoa T. Dang
    Ian M. Soboroff
    [J]. Scientific Data, 9
  • [7] Automatic question answering for multiple stakeholders, the epidemic question answering dataset
    Goodwin, Travis R.
    Demner-Fushman, Dina
    Lo, Kyle
    Wang, Lucy Lu
    Dang, Hoa T.
    Soboroff, Ian M.
    [J]. SCIENTIFIC DATA, 2022, 9 (01)
  • [8] PQuAD: A Persian question answering dataset
    Darvishi, Kasra
    Shahbodaghkhan, Newsha
    Abbasiantaeb, Zahra
    Momtazi, Saeedeh
    [J]. COMPUTER SPEECH AND LANGUAGE, 2023, 80
  • [9] FQuAD: French Question Answering Dataset
    d'Hoffschmidt, Martin
    Belblidia, Wacim
    Heinrich, Quentin
    Brendle, Tom
    Vidal, Maxime
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1193 - 1208
  • [10] VQuAnDa: Verbalization QUestion ANswering DAtaset
    Kacupaj, Endri
    Zafar, Hamid
    Lehmann, Jens
    Maleshkova, Maria
    [J]. SEMANTIC WEB (ESWC 2020), 2020, 12123 : 531 - 547