DISFL-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

被引:0
|
作者
Gupta, Aditya [1 ]
Xu, Jiacheng [2 ,4 ]
Upadhyay, Shyam [1 ]
Yang, Diyi [3 ]
Faruqui, Manaal [1 ]
机构
[1] Google Assistant, Mountain View, CA USA
[2] Univ Texas Austin, Austin, TX 78712 USA
[3] Georgia Inst Technol, Atlanta, GA 30332 USA
[4] Google, Mountain View, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Disfluencies is an under-studied topic in NLP, even though it is ubiquitous in human conversation. This is largely due to the lack of datasets containing disfluencies. In this paper, we present a new challenge question answering dataset, DISFL-QA, a derivative of SQUAD, where humans introduce contextual disfluencies in previously fluent questions. DISFL- QA contains a variety of challenging disfluencies that require a more comprehensive understanding of the text than what was necessary in prior datasets. Experiments show that the performance of existing state-of-the-art question answering models degrades significantly when tested on DISFLQA in a zero-shot setting. We show data augmentation methods partially recover the loss in performance and also demonstrate the efficacy of using gold data for fine-tuning. We argue that we need large-scale disfluency datasets in order for NLP models to be robust to them. The dataset is publicly available at: https://github.com/ google-research-datasets/disfl-qa.
引用
收藏
页码:3309 / 3319
页数:11
相关论文
共 50 条
  • [41] MemoriQA: A Question-Answering Lifelog Dataset
    Tran, Quang-Linh
    Nguyen, Binh
    Jones, Gareth J. F.
    Gurrin, Cathal
    [J]. PROCEEDINGS OF THE FIRST ACM WORKSHOP ON AI-POWERED QUESTION ANSWERING SYSTEMS FOR MULTIMEDIA, AIQAM 2024, 2024, : 7 - 12
  • [42] Towards a Polish Question Answering Dataset (PoQuAD)
    Tuora, Ryszard
    Zawadzka-Paluektau, Natalia
    Klamra, Cezary
    Zwierzchowska, Aleksandra
    Kobylinski, Lukasz
    [J]. FROM BORN-PHYSICAL TO BORN-VIRTUAL: AUGMENTING INTELLIGENCE IN DIGITAL LIBRARIES, ICADL 2022, 2022, 13636 : 194 - 203
  • [43] TutorialVQA: Question Answering Dataset for Tutorial Videos
    Colas, Anthony
    Kim, Seokhwan
    Dernoncourt, Franck
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5450 - 5455
  • [44] QED: A Framework and Dataset for Explanations in Question Answering
    Lamm, Matthew
    Palomaki, Jennimaria
    Alberti, Chris
    Andor, Daniel
    Choi, Eunsol
    Soares, Livio Baldini
    Collins, Michael
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 790 - 806
  • [45] A Portuguese Dataset for Evaluation of Semantic Question Answering
    de Araujo, Denis Andrei
    Rigo, Sandro Jose
    Quaresma, Paulo
    Muniz, Joao Henrique
    [J]. COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROPOR 2020, 2020, 12037 : 217 - 227
  • [46] Bilingual Question Answering Using CINDI_QA at QA@CLEF 2007
    Haddad, Chedid
    Desai, Bipin C.
    [J]. ADVANCES IN MULTILINGUAL AND MULTIMODAL INFORMATION RETRIEVAL, 2008, 5152 : 308 - 315
  • [47] Creating and validating the Fine-Grained Question Subjectivity Dataset (FQSD): A new benchmark for enhanced automatic subjective question answering systems
    Babaali, Marzieh
    Fatemi, Afsaneh
    Nematbakhsh, Mohammad Ali
    [J]. PLOS ONE, 2024, 19 (05):
  • [48] Complementary QA Network Analysis for QA Retrieval in Social Question-Answering Websites
    Liu, Duen-Ren
    Chen, Yu-Hsuan
    Shen, Minxin
    Lu, Pei-Jung
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (01) : 99 - 116
  • [49] Single-dataset Experts for Multi-dataset Question Answering
    Friedman, Dan
    Dodge, Ben
    Chen, Danqi
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6128 - 6137
  • [50] SD-QA: Spoken Dialectal Question Answering for the RealWorld
    Faisal, Fahim
    Keshava, Sharlina
    Ibn Alam, Md Mahfuz
    Anastasopoulos, Antonios
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3296 - 3315