Evaluating the evaluation: A case study using the TREC 2002 question answering track

被引：0

作者：

Voorhees, EM ^{[1
]}

机构：

[1] Natl Inst Stand & Technol, Gaithersburg, MD 20899 USA

来源：

HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE | 2003年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different.

引用

下载

页码：260 / 267

页数：8

共 50 条

[41] Lessons learned to enable question answering on knowledge graphs extracted from scientific publications: A case study on the coronavirus literature
Badenes-Olmedo, Carlos
Corcho, Oscar
JOURNAL OF BIOMEDICAL INFORMATICS, 2023, 142
[42] Comparison of Methods for Evaluating Pavement Interventions Evaluation and Case Study
Khurshid, Muhammad Bilal
Irfan, Muhammad
Labi, Samuel
TRANSPORTATION RESEARCH RECORD, 2009, (2108) : 25 - 36
[43] Evaluating usability evaluation methods: Criteria, method and a case study
Koutsabasis, P.
Spyrou, T.
Darzentas, J.
HUMAN-COMPUTER INTERACTION, PT 1, PROCEEDINGS: INTERACTION DESIGN AND USABILITY, 2007, 4550 : 569 - +
[44] Evaluating Growth Models: A Case Study Using PrognosisBC
Marshall, Peter
Parysow, Pablo
Akindele, Shadrach
THIRD FOREST VEGETATION SIMULATOR CONFERENCE, 2008, 54 : 167 - +
[45] Modeling Extractive Question Answering Using Encoder-Decoder Models with Constrained Decoding and Evaluation-Based Reinforcement Learning
Li, Shaobo
Sun, Chengjie
Liu, Bingquan
Liu, Yuanchao
Ji, Zhenzhou
MATHEMATICS, 2023, 11 (07)
[46] Evaluation of track geometry degradation in swedish heavy haul railroad - A case study
Khouy, Iman Arasteh K.
Schunnesson, Hakån
Nissen, Arne
Juntti, Ulla J.
International Journal of COMADEM, 2012, 15 (02): : 11 - 16
[47] Evaluation of track geometry maintenance for a heavy haul railroad in Sweden: A case study
Khouy, Iman Arasteh
Schunnesson, Hakan
Juntti, Ulla
Nissen, Arne
Larsson-Kraik, Per-Olof
PROCEEDINGS OF THE INSTITUTION OF MECHANICAL ENGINEERS PART F-JOURNAL OF RAIL AND RAPID TRANSIT, 2014, 228 (05) : 496 - 503
[48] Railway Track Monitoring Using Train Measurements: An Experimental Case Study
Malekjafarian, Abdollah
OBrien, Eugene
Quirke, Paraic
Bowe, Cathal
APPLIED SCIENCES-BASEL, 2019, 9 (22):
[49] A Semi-Supervised Learning Approach to Enhance Health Care Community-Based Question Answering: A Case Study in Alcoholism
Wongchaisuwat, Papis
Klabjan, Diego
Jonnalagadda, Siddhartha Reddy
JMIR MEDICAL INFORMATICS, 2016, 4 (03): : 18 - 30
[50] Evaluating Chatbot Efficacy for Answering Frequently Asked Questions in Plastic Surgery: A ChatGPT Case Study Focused on Breast Augmentation
Seth, Ishith
Cox, Aram
Xie, Yi
Bulloch, Gabriella
Hunter-Smith, David J.
Rozen, Warren M.
Ross, Richard J.
AESTHETIC SURGERY JOURNAL, 2023, 43 (10) : 1126 - 1135

← 1 2 3 4 5 →