The Impact of Judgment Variability on the Consistency of Offline Effectiveness Measures

被引:1
|
作者
Rashidi, Lida [1 ]
Zobel, Justin [1 ]
Moffat, Alistair [1 ]
机构
[1] Univ Melbourne, Sch Comp & Informat Syst, Parkville, Vic 3010, Australia
基金
澳大利亚研究理事会;
关键词
Evaluation; relevance assessment; significance testing; RELEVANCE JUDGMENTS;
D O I
10.1145/3596511
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Measurement of the effectiveness of search engines is often based on use of relevance judgments. It is well known that judgments can be inconsistent between judges, leading to discrepancies that potentially affect not only scores but also system relativities and confidence in the experimental outcomes. We take the perspective that the relevance judgments are an amalgam of perfect relevance assessments plus errors; making use of a model of systematic errors in binary relevance judgments that can be tuned to reflect the kind of judge that is being used, we explore the behavior ofmeasures of effectiveness as error is introduced. Using a novel methodology in which we examine the distribution of "true" effectiveness measurements that could be underlying measurements based on sets of judgments that include error, we find that even moderate amounts of error can lead to conclusions such as orderings of systems that statistical tests report as significant but are nonetheless incorrect. Further, in these results the widely used recall-based measures AP and NDCG are notably more fragile in the presence of judgment error than is the utility-based measure RBP, but all the measures failed under even moderate error rates. We conclude that knowledge of likely error rates in judgments is critical to interpretation of experimental outcomes.
引用
收藏
页数:31
相关论文
共 50 条
  • [11] Measures for Impact, Consistency, and the h- and g-Indices
    Prathap, Gangan
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2014, 65 (05) : 1076 - 1078
  • [12] Relationships among consistency/variability and other phonological measures over time.
    Dodd, B
    Holm, A
    Crosbie, S
    TOPICS IN LANGUAGE DISORDERS, 2006, 26 (02) : 172 - 174
  • [13] A new judgment method for the satisfying consistency of linguistic judgment matrix
    Dai, Jianhua
    Li, Jun
    Xue, Hengxin
    FIFTH WUHAN INTERNATIONAL CONFERENCE ON E-BUSINESS, VOLS 1-3: INTEGRATION AND INNOVATION THROUGH MEASUREMENT AND MANAGEMENT, 2006, : 90 - 96
  • [14] Grey Judgment Matrix and Its Consistency
    Feng, Lixiang
    Yuan, Chaoqing
    Liu, Sifeng
    2008 7TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION, VOLS 1-23, 2008, : 1077 - 1080
  • [15] Consistency and Agreement in the Judgment of Rorschach Signs
    Forer, B. R.
    Farberow, N. L.
    Meyer, M. M.
    Tolman, R. S.
    JOURNAL OF PROJECTIVE TECHNIQUES, 1952, 16 (03): : 346 - 351
  • [16] On the Variability if Individual Judgment
    Reichardt
    ARCHIV FUR DIE GESAMTE PSYCHOLOGIE, 1910, 16 (3-4): : 89 - 89
  • [17] Balancing consistency and expert judgment in AHP
    Benitez, J.
    Delgado-Galvan, X.
    Gutierrez, J. A.
    Izquierdo, J.
    MATHEMATICAL AND COMPUTER MODELLING, 2011, 54 (7-8) : 1785 - 1790
  • [18] Variability in the affective judgment
    Hunt, WA
    Flannery, J
    AMERICAN JOURNAL OF PSYCHOLOGY, 1938, 51 : 507 - 513
  • [19] REASSESSMENT OF CONSISTENCY CRITERIA IN JUDGMENT MATRICES
    DODD, FJ
    DONEGAN, HA
    MCMASTER, TBM
    STATISTICIAN, 1995, 44 (01): : 31 - 41
  • [20] Judgment scales and consistency measure in AHP
    Franek, Jiri
    Kresta, Ales
    17TH INTERNATIONAL CONFERENCE ENTERPRISE AND COMPETITIVE ENVIRONMENT 2014, 2014, 12 : 164 - 173