Gauging the Quality of Relevance Assessments using Inter-Rater Agreement

被引:5
|
作者
Damessie, Tadele T. [1 ]
Nghiem, Thao P. [1 ]
Scholer, Falk [1 ]
Culpepper, J. Shane [1 ]
机构
[1] RMIT Univ, Melbourne, Vic, Australia
基金
澳大利亚研究理事会;
关键词
D O I
10.1145/3077136.3080729
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, gathering relevance judgments through non-topic originators has become an increasingly important problem in Information Retrieval. Relevance judgments can be used to measure the effectiveness of a system, and are often needed to build supervised learning models in learning-to-rank retrieval systems. The two most popular approaches to gathering bronze level judgments - where the judge is not the originator of the information need for which relevance is being assessed, and is not a topic expert - is through a controlled user study, or through crowdsourcing. However, judging comes at a cost (in time, and usually money) and the quality of the judgments can vary widely. In this work, we directly compare the reliability of judgments using three different types of bronze assessor groups. Our first group is a controlled Lab group; the second and third are two different crowdsourcing groups, CF-Document where assessors were free to judge any number of documents for a topic, and CF-Topic where judges were required to judge all of the documents from a single topic, in a manner similar to the Lab group. Our study shows that Lab assessors exhibit a higher level of agreement with a set of ground truth judgments than CF-Topic and CF-Document assessors. Inter-rater agreement rates show analogous trends. These finding suggests that in the absence of ground truth data, agreement between assessors can be used to reliably gauge the quality of relevance judgments gathered from secondary assessors, and that controlled user studies are more likely to produce reliable judgments despite being more costly.
引用
收藏
页码:1089 / 1092
页数:4
相关论文
共 50 条
  • [1] Comparison between Inter-rater Reliability and Inter-rater Agreement in Performance Assessment
    Liao, Shih Chieh
    Hunt, Elizabeth A.
    Chen, Walter
    [J]. ANNALS ACADEMY OF MEDICINE SINGAPORE, 2010, 39 (08) : 613 - 618
  • [2] Inter-rater agreement: a methodological issue
    Shahsavari, Meisam
    Shahsavari, Soodeh
    [J]. JOURNAL OF NEUROSURGERY, 2019, 131 (02) : 651 - 651
  • [3] Bayesian analysis for inter-rater agreement
    Broemeling, LD
    [J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2001, 30 (03) : 437 - 446
  • [4] National inter-rater agreement of standardised simulated-patient-based assessments
    Sam, Amir H.
    Reid, Michael D.
    Thakerar, Viral
    Gurnell, Mark
    Westacott, Rachel
    Reed, Malcolm W. R.
    Brown, Celia A.
    [J]. MEDICAL TEACHER, 2021, 43 (03) : 341 - 346
  • [5] Establishing Inter-rater Agreement for TIDEE's Teamwork and Professional Development Assessments
    Gerlick, Robert
    Davis, Denny C.
    Trevisan, Michael S.
    Brown, Shane A.
    [J]. 2011 ASEE ANNUAL CONFERENCE & EXPOSITION, 2011,
  • [6] Double Entropy Inter-Rater Agreement Indices
    Olenko, Andriy
    Tsyganok, Vitaliy
    [J]. APPLIED PSYCHOLOGICAL MEASUREMENT, 2016, 40 (01) : 37 - 55
  • [7] The inter-rater reliability of mental capacity assessments
    Raymont, Vanessa
    Buchanan, Alec
    David, Anthony S.
    Hayward, Peter
    Wessely, Simon
    Hotopf, Matthew
    [J]. INTERNATIONAL JOURNAL OF LAW AND PSYCHIATRY, 2007, 30 (02) : 112 - 117
  • [8] Inter-rater Agreement for Social Computing Studies
    Salminen, Joni O.
    Al-Merekhi, Hind A.
    Dey, Partha
    Jansen, Bernard J.
    [J]. 2018 FIFTH INTERNATIONAL CONFERENCE ON SOCIAL NETWORKS ANALYSIS, MANAGEMENT AND SECURITY (SNAMS), 2018, : 80 - 87
  • [9] Inter-rater Agreement for the Clinical Dysphagia Scale
    Chun, Se Woong
    Lee, Seung Ah
    Jung, Il-Young
    Beom, Jaewon
    Han, Tai Ryoon
    Oh, Byung-Mo
    [J]. ANNALS OF REHABILITATION MEDICINE-ARM, 2011, 35 (04): : 470 - 476
  • [10] INTER-RATER AND INTRA-RATER AGREEMENT OF THE REHABILITATION ACTIVITIES PROFILE
    JELLES, F
    VANBENNEKOM, CAM
    LANKHORST, GJ
    SIBBEL, CJP
    BOUTER, LM
    [J]. JOURNAL OF CLINICAL EPIDEMIOLOGY, 1995, 48 (03) : 407 - 416