Evaluating Open-Domain Question Answering in the Era of Large Language Models

被引:0
|
作者
Kamalloo, Ehsan [1 ,2 ]
Dziri, Nouha [3 ]
Clarke, Charles L. A. [2 ]
Rafiei, Davood [1 ]
机构
[1] Univ Alberta, Edmonton, AB, Canada
[2] Univ Waterloo, Waterloo, ON, Canada
[3] Allen Inst Artificial Intelligence, Seattle, WA USA
来源
PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1 | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is increasingly the case as we shift from extractive to generative models. The recent success of large language models (LLMs) for QA aggravates lexical matching failures since candidate answers become longer, thereby making matching with the gold answers even more challenging. Without accurate evaluation, the true progress in open-domain QA remains unknown. In this paper, we conduct a thorough analysis of various open-domain QA models, including LLMs, by manually evaluating their answers on a subset of NQ-OPEN, a popular benchmark. Our assessments reveal that while the true performance of all models is significantly underestimated, the performance of the InstructGPT (zero-shot) LLM increases by nearly +60%, making it on par with existing top models, and the InstructGPT (few-shot) model actually achieves a new state-of-the-art on NQ-OPEN. We also find that more than 50% of lexical matching failures are attributed to semantically equivalent answers. We further demonstrate that regex matching ranks QA models consistent with human judgments, although still suffering from unnecessary strictness. Finally, we demonstrate that automated evaluation models are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers generated by LLMs. The automated models struggle in detecting hallucinations in LLM answers and are thus unable to evaluate LLMs. At this time, there appears to be no substitute for human evaluation.
引用
收藏
页码:5591 / 5606
页数:16
相关论文
共 50 条
  • [21] Designing an interactive open-domain question answering system
    Quarteroni, S.
    Manandhar, S.
    NATURAL LANGUAGE ENGINEERING, 2009, 15 : 73 - 95
  • [22] Document Gated Reader for Open-Domain Question Answering
    Wang, Bingning
    Yao, Ting
    Zhang, Qi
    Xu, Jingfang
    Tian, Zhixing
    Liu, Kang
    Zhao, Jun
    PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 85 - 94
  • [23] Denoising Distantly Supervised Open-Domain Question Answering
    Lin, Yankai
    Ji, Haozhe
    Liu, Zhiyuan
    Sun, Maosong
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1736 - 1745
  • [24] The structure and performance of an open-domain question answering system
    Moldovan, D
    Harabagiu, S
    Pasca, M
    Mihalcea, R
    Girju, R
    Goodrum, R
    Rus, V
    38TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2000, : 563 - 570
  • [25] AVADHAN: System for Open-Domain Telugu Question Answering
    Ravva, Priyanka
    Urlana, Ashok
    Shrivastava, Manish
    PROCEEDINGS OF THE 7TH ACM IKDD CODS AND 25TH COMAD (CODS-COMAD 2020), 2020, : 234 - 238
  • [26] Detecting Frozen Phrases in Open-Domain Question Answering
    Yadegari, Mostafa
    Kamalloo, Ehsan
    Rafiei, Davood
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 1990 - 1996
  • [27] A dataset and baselines for sequential open-domain question answering
    Elgohary, Ahmed
    Zhao, Chen
    Boyd-Graber, Jordan
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 1077 - 1083
  • [28] Leveraging Knowledge Graph for Open-domain Question Answering
    Costa, Jose Ortiz
    Kulkarni, Anagha
    2018 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2018), 2018, : 389 - 394
  • [29] Complementary Evidence Identification in Open-Domain Question Answering
    Mou, Xiangyang
    Yu, Mo
    Chang, Shiyu
    Feng, Yufei
    Zhang, Li
    Su, Hui
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 2720 - 2726
  • [30] Dense Passage Retrieval for Open-Domain Question Answering
    Karpukhin, Vladimir
    Oguz, Barlas
    Min, Sewon
    Lewis, Patrick
    Wu, Ledell
    Edunov, Sergey
    Chen, Danqi
    Yih, Wen Tau
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6769 - 6781