Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering

被引:0
|
作者
Carrino, Casimiro Pio [1 ]
Costa-jussa, Marta R. [1 ]
Fonollosa, Jose A. R. [1 ]
机构
[1] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain
关键词
Question Answering; Multilinguality; Corpus Creation;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Recently, multilingual question answering became a crucial research topic, and it is receiving increased interest in the NLP community. However, the unavailability of large-scale datasets makes it challenging to train multilingual QA systems with performance comparable to the English ones. In this work, we develop the Translate Align Retrieve (TAR) method to automatically translate the Stanford Question Answering Dataset (SQuAD) v1.1 to Spanish. We then used this dataset to train Spanish QA systems by fine-tuning a Multilingual-BERT model. Finally, we evaluated our QA models with the recently proposed MLQA and XQuAD benchmarks for cross-lingual Extractive QA. Experimental results show that our models outperform the previous Multilingual-BERT baselines achieving the new state-of-the-art values of 68.1 F1 on the Spanish MLQA corpus and 77.6 F1 on the Spanish XQuAD corpus. The resulting, synthetically generated SQuAD-es v1.1 corpora, with almost 100% of data contained in the original English version, to the best of our knowledge, is the first large-scale QA training resource for Spanish.
引用
收藏
页码:5515 / 5523
页数:9
相关论文
共 50 条
  • [1] Slovak Dataset for Multilingual Question Answering
    Hladek, Daniel
    Stas, Jan
    Juhar, Jozef
    Koctur, Tomas
    [J]. IEEE ACCESS, 2023, 11 : 32869 - 32881
  • [2] Automatic question answering for multiple stakeholders, the epidemic question answering dataset
    Travis R. Goodwin
    Dina Demner-Fushman
    Kyle Lo
    Lucy Lu Wang
    Hoa T. Dang
    Ian M. Soboroff
    [J]. Scientific Data, 9
  • [3] Automatic question answering for multiple stakeholders, the epidemic question answering dataset
    Goodwin, Travis R.
    Demner-Fushman, Dina
    Lo, Kyle
    Wang, Lucy Lu
    Dang, Hoa T.
    Soboroff, Ian M.
    [J]. SCIENTIFIC DATA, 2022, 9 (01)
  • [4] Multilingual Question Answering Systems: Question Classification in Spanish based in Learning
    Garcia Cumbreras, Miguel Angel
    Martinez Santiago, Fernando
    Alfonso Urena Lopez, L.
    Montejo Raez, Arturo
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2005, (34):
  • [5] MLPQ: A Dataset for Path Question Answering over Multilingual Knowledge Graphs
    Tan, Yiming
    Chen, Yongrui
    Qi, Guilin
    Li, Weizhuo
    Wang, Meng
    [J]. BIG DATA RESEARCH, 2023, 32
  • [6] Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering
    Gao, Haoyuan
    Mao, Junhua
    Zhou, Jie
    Huang, Zhiheng
    Wang, Lei
    Xu, Wei
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [7] Reproducing a Neural Question Answering Architecture Applied to the SQuAD Benchmark Dataset: Challenges and Lessons Learned
    Duer, Alexander
    Rauber, Andreas
    Filzmoser, Peter
    [J]. ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 : 102 - 113
  • [8] Analysis of automatic translation of questions for question-answering systems
    Garcia-Santiago, Lola
    Olvera-Lobo, Maria-Dolores
    [J]. INFORMATION RESEARCH-AN INTERNATIONAL ELECTRONIC JOURNAL, 2010, 15 (04):
  • [9] QALD-9-ES: A Spanish Dataset for Question Answering Systems
    Soruco, Javier
    Collarana, Diego
    Both, Andreas
    Usbeck, Ricardo
    [J]. KNOWLEDGE GRAPHS: SEMANTICS, MACHINE LEARNING, AND LANGUAGES, 2023, 56 : 38 - 52
  • [10] VISCOUNTH: A Large-scale Multilingual Visual Question Answering Dataset for Cultural Heritage
    Becattini, Federico
    Bongini, Pietro
    Bulla, Luana
    Marinucci, Ludovica
    del Bimbo, Alberto
    Mongiovi, Misael
    Presutti, Valentina
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)