Gauging the Quality of Relevance Assessments using Inter-Rater Agreement

被引：5

作者：

Damessie, Tadele T. ^{[1
]}

Nghiem, Thao P. ^{[1
]}

Scholer, Falk ^{[1
]}

Culpepper, J. Shane ^{[1
]}

机构：

[1] RMIT Univ, Melbourne, Vic, Australia

来源：

SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL | 2017年

基金：

澳大利亚研究理事会;

关键词：

D O I：

10.1145/3077136.3080729

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, gathering relevance judgments through non-topic originators has become an increasingly important problem in Information Retrieval. Relevance judgments can be used to measure the effectiveness of a system, and are often needed to build supervised learning models in learning-to-rank retrieval systems. The two most popular approaches to gathering bronze level judgments - where the judge is not the originator of the information need for which relevance is being assessed, and is not a topic expert - is through a controlled user study, or through crowdsourcing. However, judging comes at a cost (in time, and usually money) and the quality of the judgments can vary widely. In this work, we directly compare the reliability of judgments using three different types of bronze assessor groups. Our first group is a controlled Lab group; the second and third are two different crowdsourcing groups, CF-Document where assessors were free to judge any number of documents for a topic, and CF-Topic where judges were required to judge all of the documents from a single topic, in a manner similar to the Lab group. Our study shows that Lab assessors exhibit a higher level of agreement with a set of ground truth judgments than CF-Topic and CF-Document assessors. Inter-rater agreement rates show analogous trends. These finding suggests that in the absence of ground truth data, agreement between assessors can be used to reliably gauge the quality of relevance judgments gathered from secondary assessors, and that controlled user studies are more likely to produce reliable judgments despite being more costly.

引用

页码：1089 / 1092

页数：4

共 50 条

[1] Comparison between Inter-rater Reliability and Inter-rater Agreement in Performance Assessment
Liao, Shih Chieh
Hunt, Elizabeth A.
Chen, Walter
[J]. ANNALS ACADEMY OF MEDICINE SINGAPORE, 2010, 39 (08) : 613 - 618
[2] Inter-rater agreement: a methodological issue
Shahsavari, Meisam
Shahsavari, Soodeh
[J]. JOURNAL OF NEUROSURGERY, 2019, 131 (02) : 651 - 651
[3] Bayesian analysis for inter-rater agreement
Broemeling, LD
[J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2001, 30 (03) : 437 - 446
[4] National inter-rater agreement of standardised simulated-patient-based assessments
Sam, Amir H.
Reid, Michael D.
Thakerar, Viral
Gurnell, Mark
Westacott, Rachel
Reed, Malcolm W. R.
Brown, Celia A.
[J]. MEDICAL TEACHER, 2021, 43 (03) : 341 - 346
[5] Establishing Inter-rater Agreement for TIDEE's Teamwork and Professional Development Assessments
Gerlick, Robert
Davis, Denny C.
Trevisan, Michael S.
Brown, Shane A.
[J]. 2011 ASEE ANNUAL CONFERENCE & EXPOSITION, 2011,
[6] Double Entropy Inter-Rater Agreement Indices
Olenko, Andriy
Tsyganok, Vitaliy
[J]. APPLIED PSYCHOLOGICAL MEASUREMENT, 2016, 40 (01) : 37 - 55
[7] The inter-rater reliability of mental capacity assessments
Raymont, Vanessa
Buchanan, Alec
David, Anthony S.
Hayward, Peter
Wessely, Simon
Hotopf, Matthew
[J]. INTERNATIONAL JOURNAL OF LAW AND PSYCHIATRY, 2007, 30 (02) : 112 - 117
[8] Inter-rater Agreement for Social Computing Studies
Salminen, Joni O.
Al-Merekhi, Hind A.
Dey, Partha
Jansen, Bernard J.
[J]. 2018 FIFTH INTERNATIONAL CONFERENCE ON SOCIAL NETWORKS ANALYSIS, MANAGEMENT AND SECURITY (SNAMS), 2018, : 80 - 87
[9] Inter-rater Agreement for the Clinical Dysphagia Scale
Chun, Se Woong
Lee, Seung Ah
Jung, Il-Young
Beom, Jaewon
Han, Tai Ryoon
Oh, Byung-Mo
[J]. ANNALS OF REHABILITATION MEDICINE-ARM, 2011, 35 (04): : 470 - 476
[10] INTER-RATER AND INTRA-RATER AGREEMENT OF THE REHABILITATION ACTIVITIES PROFILE
JELLES, F
VANBENNEKOM, CAM
LANKHORST, GJ
SIBBEL, CJP
BOUTER, LM
[J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 1995, 48 (03) : 407 - 416

← 1 2 3 4 5 →