Evaluating Open-Domain Question Answering in the Era of Large Language Models

被引:0
|
作者
Kamalloo, Ehsan [1 ,2 ]
Dziri, Nouha [3 ]
Clarke, Charles L. A. [2 ]
Rafiei, Davood [1 ]
机构
[1] Univ Alberta, Edmonton, AB, Canada
[2] Univ Waterloo, Waterloo, ON, Canada
[3] Allen Inst Artificial Intelligence, Seattle, WA USA
来源
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-OPEN, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-OPEN. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.
引用
收藏
页码:5591 / 5606
页数:16
相关论文
共 50 条
  • [41] An astronomical question answering dataset for evaluating large language models
    Li, Jie
    Zhao, Fuyong
    Chen, Panfeng
    Xie, Jiafu
    Zhang, Xiangrui
    Li, Hui
    Chen, Mei
    Wang, Yanhao
    Zhu, Ming
    SCIENTIFIC DATA, 2025, 12 (01)
  • [42] RobustQA: Benchmarking the Robustness of Domain Adaptation for Open-Domain Question Answering
    Han, Rujun
    Qi, Peng
    Zhang, Yuhao
    Liu, Lan
    Burger, Juliette
    Wang, William Yang
    Huang, Zhiheng
    Xiang, Bing
    Roth, Dan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 4294 - 4311
  • [43] Learning to Transform, Combine, and Reason in Open-Domain Question Answering
    Dehghani, Mostafa
    Azarbonyad, Hosein
    Kamps, Jaap
    de Rijke, Maarten
    PROCEEDINGS OF THE TWELFTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM'19), 2019, : 681 - 689
  • [44] Knowledge Graph Enabled Open-Domain Conversational Question Answering
    Oduro-Afriyie, Joel
    Jamil, Hasan
    FLEXIBLE QUERY ANSWERING SYSTEMS, FQAS 2023, 2023, 14113 : 63 - 76
  • [45] TopiOCQA: Open-domain Conversational Question Answering with Topic Switching
    Adlakha, Vaibhav
    Dhuliawala, Shehzaad
    Suleman, Kaheer
    de Vries, Harm
    Reddy, Siva
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 468 - 483
  • [46] Dynamic Graph Reasoning for Conversational Open-Domain Question Answering
    Li, Yongqi
    Li, Wenjie
    Nie, Liqiang
    ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2022, 40 (04)
  • [47] An evidence-based approach for open-domain question answering
    Jafarzadeh, Parastoo
    Ensan, Faezeh
    KNOWLEDGE AND INFORMATION SYSTEMS, 2025, 67 (02) : 1969 - 1991
  • [48] Methods for Using Textual Entailment in Open-Domain Question Answering
    Harabagiu, Sanda
    Hickl, Andrew
    COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 905 - 912
  • [49] Two-Phase Open-Domain Question Answering System
    Prasannan, Vysakh
    Shemshian, Shahin
    Gurkan, Arinc
    Saheer, Lakshmi Babu
    Oghaz, Mahdi Maktabdar
    ARTIFICIAL INTELLIGENCE XXXIX, AI 2022, 2022, 13652 : 353 - 358
  • [50] FriendsQA: Open-Domain Question Answering on TV Show Transcripts
    Yang, Zhengzhe
    Choi, Jinho D.
    20TH ANNUAL MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE (SIGDIAL 2019), 2019, : 188 - 197