Training Question Answering Models From Synthetic Data

被引:0
|
作者
Puri, Raul [1 ]
Spring, Ryan [2 ]
Shoeybi, Mohammad [1 ]
Patwary, Mostofa [1 ]
Catanzaro, Bryan [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Rice Univ, Houston, TX USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of large language models and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQUAD1.1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQUAD1.1 training set questions alone. Removing access to real Wikipedia data, we synthesize questions and answers from a synthetic text corpus generated by an 8.3 billion parameter GPT-2 model and achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQUAD1.1 dev set. We further apply our methodology to SQUAD2.0 and show a 2.8 absolute gain on EM score compared to prior work using synthetic data.
引用
收藏
页码:5811 / 5826
页数:16
相关论文
共 50 条
  • [21] Video Question Answering: a Survey of Models and Datasets
    Sun, Guanglu
    Liang, Lili
    Li, Tianlin
    Yu, Bo
    Wu, Meng
    Zhang, Bolun
    MOBILE NETWORKS & APPLICATIONS, 2021, 26 (05): : 1904 - 1937
  • [22] Finetuning Language Models for Multimodal Question Answering
    Zhang, Xin
    Xie, Wen
    Dai, Ziqi
    Rao, Jun
    Wen, Haokun
    Luo, Xuan
    Zhang, Meishan
    Zhang, Min
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9420 - 9424
  • [23] Generating Synthetic Data for Neural Keyword-to-Question Models
    Ding, Heng
    Balog, Krisztian
    PROCEEDINGS OF THE 2018 ACM SIGIR INTERNATIONAL CONFERENCE ON THEORY OF INFORMATION RETRIEVAL (ICTIR'18), 2018, : 51 - 58
  • [24] Span Selection Pre-training for Question Answering
    Glass, Michael
    Gliozzo, Alfio
    Chakravarti, Rishav
    Ferritto, Anthony
    Pan, Lin
    Bhargav, G. P. Shrivatsa
    Garg, Dinesh
    Sil, Avirup
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 2773 - 2782
  • [25] Question Answering with Additive Restrictive Training (QuAART): Question Answering for the Rapid Development of New Knowledge Extraction Pipelines
    Harper, Corey A.
    Daniel, Ron, Jr.
    Groth, Paul
    KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT, EKAW 2022, 2022, 13514 : 51 - 65
  • [26] Incorporation of question segregation procedures in visual question-answering models
    Chowdhury, Souvik
    Soni, Badal
    Phukan, Doli
    INTERNATIONAL JOURNAL OF COMPUTING SCIENCE AND MATHEMATICS, 2024, 20 (02) : 99 - 108
  • [27] GQA: Grammatical Question Answering for RDF Data
    Zimina, Elizaveta
    Nummenmaa, Jyrki
    Jarvelin, Kalervo
    Peltonen, Jaakko
    Stefanidis, Kostas
    Hyyro, Heikki
    SEMANTIC WEB CHALLENGES, SEMWEBEVAL 2018, 2018, 927 : 82 - 97
  • [28] Crowdsourced Linked Data Question Answering with AQUACOLD
    Collis, Nicholas
    Frommholz, Ingo
    2021 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL 2021), 2021, : 297 - 298
  • [29] Evaluating question answering over linked data
    Lopez, Vanessa
    Unger, Christina
    Cimiano, Philipp
    Motta, Enrico
    JOURNAL OF WEB SEMANTICS, 2013, 21 : 3 - 13
  • [30] Data Augmentation for Biomedical Factoid Question Answering
    Pappas, Dimitris
    Malakasiotis, Prodromos
    Androutsopoulos, Ion
    PROCEEDINGS OF THE 21ST WORKSHOP ON BIOMEDICAL LANGUAGE PROCESSING (BIONLP 2022), 2022, : 63 - 81