Meta-evaluation of Conversational Search Evaluation Metrics

被引:6
|
作者
Liu, Zeyang [1 ]
Zhou, Ke [1 ,2 ]
Wilson, Max L. [1 ]
机构
[1] Univ Nottingham, Sch Comp Sci, Jubilee Campus Wollaton Rd, Nottingham NG8 1BB, England
[2] Nokia Bell Labs, Broers Bldg, Cambridge CB3 0FA, England
关键词
Conversational search; meta-evaluation; metric; discriminative power;
D O I
10.1145/3445029
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability: the ability to detect "actual" performance differences as opposed to those observed by chance; (2) fidelity: the ability to agree with ultimate user preference; and (3) intuitiveness: the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.
引用
收藏
页数:42
相关论文
共 50 条
  • [1] Meta-evaluation of Online and Offline Web Search Evaluation Metrics
    Chen, Ye
    Zhou, Ke
    Liu, Yiqun
    Zhang, Min
    Ma, Shaoping
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 15 - 24
  • [2] A Meta-Evaluation of Evaluation Methods for Diversified Search
    Kingrani, Suneel Kumar
    Levene, Mark
    Zhang, Dell
    [J]. ADVANCES IN INFORMATION RETRIEVAL (ECIR 2018), 2018, 10772 : 550 - 555
  • [3] Automatic Meta-evaluation of Low-Resource Machine Translation Evaluation Metrics
    Yu, Junting
    Liu, Wuying
    He, Hongye
    Wang, Lin
    [J]. PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 136 - 141
  • [4] State-Aware Meta-Evaluation of Evaluation Metrics in Interactive Information Retrieval
    Liu, Jiqun
    Yu, Ran
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3258 - 3262
  • [5] Meta-evaluation: Evaluation of evaluations
    L. Georghiou
    [J]. Scientometrics, 1999, 45 : 523 - 530
  • [6] Meta-evaluation: Evaluation of evaluations
    Praestgaard, E
    [J]. SCIENTOMETRICS, 1999, 45 (03) : 531 - 532
  • [7] Meta-evaluation: Evaluation of evaluations
    E. Praestgaard
    [J]. Scientometrics, 1999, 45 : 531 - 532
  • [8] Development of a Meta-Evaluation Rubric and Meta-Evaluation of Initial Teacher Education Programs
    Burakgazi, Sevinc Gelmez
    Karsantik, Yasemin
    [J]. EGITIM VE BILIM-EDUCATION AND SCIENCE, 2024, 49 (217): : 225 - 248
  • [9] BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics
    Ma, Liang
    Cao, Shuyang
    Logan, Robert L.
    Lu, Di
    Ran, Shihao
    Zhang, Ke
    Tetreault, Joel
    Jaimes, Alejandro
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 12788 - 12812
  • [10] Improving the quality of evaluation participation: a meta-evaluation
    Russ-Eft, Darlene
    Preskill, Hallie
    [J]. HUMAN RESOURCE DEVELOPMENT INTERNATIONAL, 2008, 11 (01) : 35 - 50