Evaluating the evaluation: A case study using the TREC 2002 question answering track

被引:0
|
作者
Voorhees, EM [1 ]
机构
[1] Natl Inst Stand & Technol, Gaithersburg, MD 20899 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different.
引用
下载
收藏
页码:260 / 267
页数:8
相关论文
共 50 条
  • [41] Lessons learned to enable question answering on knowledge graphs extracted from scientific publications: A case study on the coronavirus literature
    Badenes-Olmedo, Carlos
    Corcho, Oscar
    JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 142
  • [42] Comparison of Methods for Evaluating Pavement Interventions Evaluation and Case Study
    Khurshid, Muhammad Bilal
    Irfan, Muhammad
    Labi, Samuel
    TRANSPORTATION RESEARCH RECORD, 2009, (2108) : 25 - 36
  • [43] Evaluating usability evaluation methods: Criteria, method and a case study
    Koutsabasis, P.
    Spyrou, T.
    Darzentas, J.
    HUMAN-COMPUTER INTERACTION, PT 1, PROCEEDINGS: INTERACTION DESIGN AND USABILITY, 2007, 4550 : 569 - +
  • [44] Evaluating Growth Models: A Case Study Using PrognosisBC
    Marshall, Peter
    Parysow, Pablo
    Akindele, Shadrach
    THIRD FOREST VEGETATION SIMULATOR CONFERENCE, 2008, 54 : 167 - +
  • [45] Modeling Extractive Question Answering Using Encoder-Decoder Models with Constrained Decoding and Evaluation-Based Reinforcement Learning
    Li, Shaobo
    Sun, Chengjie
    Liu, Bingquan
    Liu, Yuanchao
    Ji, Zhenzhou
    MATHEMATICS, 2023, 11 (07)
  • [46] Evaluation of track geometry degradation in swedish heavy haul railroad - A case study
    Khouy, Iman Arasteh K.
    Schunnesson, Hakån
    Nissen, Arne
    Juntti, Ulla J.
    International Journal of COMADEM, 2012, 15 (02): : 11 - 16
  • [47] Evaluation of track geometry maintenance for a heavy haul railroad in Sweden: A case study
    Khouy, Iman Arasteh
    Schunnesson, Hakan
    Juntti, Ulla
    Nissen, Arne
    Larsson-Kraik, Per-Olof
    PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART F-JOURNAL OF RAIL AND RAPID TRANSIT, 2014, 228 (05) : 496 - 503
  • [48] Railway Track Monitoring Using Train Measurements: An Experimental Case Study
    Malekjafarian, Abdollah
    OBrien, Eugene
    Quirke, Paraic
    Bowe, Cathal
    APPLIED SCIENCES-BASEL, 2019, 9 (22):
  • [49] A Semi-Supervised Learning Approach to Enhance Health Care Community-Based Question Answering: A Case Study in Alcoholism
    Wongchaisuwat, Papis
    Klabjan, Diego
    Jonnalagadda, Siddhartha Reddy
    JMIR MEDICAL INFORMATICS, 2016, 4 (03): : 18 - 30
  • [50] Evaluating Chatbot Efficacy for Answering Frequently Asked Questions in Plastic Surgery: A ChatGPT Case Study Focused on Breast Augmentation
    Seth, Ishith
    Cox, Aram
    Xie, Yi
    Bulloch, Gabriella
    Hunter-Smith, David J.
    Rozen, Warren M.
    Ross, Richard J.
    AESTHETIC SURGERY JOURNAL, 2023, 43 (10) : 1126 - 1135