Training Question Answering Models From Synthetic Data

被引:0
|
作者
Puri, Raul [1 ]
Spring, Ryan [2 ]
Shoeybi, Mohammad [1 ]
Patwary, Mostofa [1 ]
Catanzaro, Bryan [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Rice Univ, Houston, TX USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of large language models and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQUAD1.1 question answering task, we achieve higher accuracy using solely synthetic questions and answers than when using the SQUAD1.1 training set questions alone. Removing access to real Wikipedia data, we synthesize questions and answers from a synthetic text corpus generated by an 8.3 billion parameter GPT-2 model and achieve 88.4 Exact Match (EM) and 93.9 F1 score on the SQUAD1.1 dev set. We further apply our methodology to SQUAD2.0 and show a 2.8 absolute gain on EM score compared to prior work using synthetic data.
引用
收藏
页码:5811 / 5826
页数:16
相关论文
共 50 条
  • [1] Consistency Training by Synthetic Question Generation for Conversational Question Answering
    Hemati, Hamed Hematian
    Beigy, Hamid
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 630 - 639
  • [2] Neural (Knowledge Graph) Question Answering Using Synthetic Training Data Doctoral Consortium
    Linjordet, Trond
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 3245 - 3248
  • [3] Exploring Models and Data for Image Question Answering
    Ren, Mengye
    Kiros, Ryan
    Zemel, Richard S.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 28 (NIPS 2015), 2015, 28
  • [4] Element Information Enhancement for Diagram Question Answering with Synthetic Data
    Zhang, Yadong
    Chen, Yang
    Ren, Yupei
    Lan, Man
    Chen, Yuefeng
    CCKS 2022 - EVALUATION TRACK, 2022, 1711 : 78 - 86
  • [5] Robust Training for Conversational Question Answering Models with Reinforced Reformulation Generation
    Kaiser, Magdalena
    Roy, Rishiraj Saha
    Weikum, Gerhard
    PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 322 - 331
  • [6] Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation
    Bartolo, Max
    Thrush, Tristan
    Jia, Robin
    Riedel, Sebastian
    Stenetorp, Pontus
    Kiela, Douwe
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 8830 - 8848
  • [7] Semi-supervised Training Data Generation for Multilingual Question Answering
    Lee, Kyungjae
    Yoon, Kyoungho
    Park, Sunghyun
    Hwang, Seung-won
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2758 - 2762
  • [8] Machine learning for question answering from tabular data
    Khalid, Mahboob Alam
    Jijkoun, Valentin
    de Rijke, Maarten
    DEXA 2007: 18TH INTERNATIONAL CONFERENCE ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2007, : 392 - +
  • [9] Synthetic Question Value Estimation for Domain Adaptation of Question Answering
    Yue, Xiang
    Yao, Ziyu
    Sun, Huan
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1340 - 1351
  • [10] SRQA: Synthetic Reader for Factoid Question Answering
    Wang, Jiuniu
    Xu, Wenjia
    Fu, Xingyu
    Wei, Yang
    Jin, Li
    Chen, Ziyan
    Xu, Guangluan
    Wu, Yirong
    KNOWLEDGE-BASED SYSTEMS, 2020, 193