Bayesian Inferential Risk Evaluation On Multiple IR Systems

被引:1
|
作者
Benham, Rodger [1 ]
Ben Carterette [2 ]
Culpepper, J. Shane [1 ]
Moffat, Alistair [3 ]
机构
[1] RMIT Univ, Melbourne, Vic, Australia
[2] Spotify, New York, NY USA
[3] Univ Melbourne, Melbourne, Vic, Australia
基金
澳大利亚研究理事会;
关键词
Bayesian inference; risk-biased evaluation; multiple comparisons; effectiveness metric; credible intervals;
D O I
10.1145/3397271.3401033
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Information retrieval (IR) ranking models in production systems continually evolve in response to user feedback, insights from research, and new developments. Rather than investing all engineering resources to produce a single challenger to the existing system, a commercial provider might choose to explore multiple new ranking models simultaneously. However, even small changes to a complex model can have unintended consequences. In particular, the per-topic effectiveness profile is likely to change, and even when an overall improvement is achieved, gains are rarely observed for every query, introducing the risk that some users or queries may be negatively impacted by the new model if deployed into production. Risk adjustments that re-weight losses relative to gains and mitigate such behavior are available when making one-to-one system comparisons, but not for one-to-many or many-to-one comparisons. Moreover, no IR evaluation methodology integrates priors from previous or alternative rankers in a homogeneous inferential framework. In this work, we propose a Bayesian approach where multiple challengers are compared to a single champion. We also show that risk can be incorporated, and demonstrate the benefits of doing so. Finally, the alternative scenario that is commonly encountered in academic research is also considered, when a single challenger is compared against several previous champions.
引用
收藏
页码:339 / 348
页数:10
相关论文
共 50 条
  • [31] ACCURACY EVALUATION BASED ON SIMULATION FOR FINITE PRECISION SYSTEMS USING INFERENTIAL STATISTICS
    Bonnot, Justine
    Desnos, Karol
    Menard, Daniel
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 1508 - 1512
  • [32] Multiple Systems Estimation for Sparse Capture Data: Inferential Challenges When There Are Nonoverlapping Lists
    Chan, Lax
    Silverman, Bernard W.
    Vincent, Kyle
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2021, 116 (535) : 1297 - 1306
  • [33] SIGNIFICANCE OF MULTIPLE INFERENTIAL TESTS - REPLY
    DEGRUY, F
    JOURNAL OF FAMILY PRACTICE, 1990, 30 (01): : 15 - 16
  • [34] Performance-oriented risk evaluation and maintenance for multi-asset systems: A Bayesian perspective
    Zhao, Xiujie
    Liang, Zhenglin
    Parlikad, Ajith K.
    Xie, Min
    IISE TRANSACTIONS, 2022, 54 (03) : 251 - 270
  • [35] On Bayesian Trust and Risk Forecasting for Compound Systems
    Rass, Stefan
    Kurowski, Sebastian
    2013 SEVENTH INTERNATIONAL CONFERENCE ON IT SECURITY INCIDENT MANAGEMENT AND IT FORENSICS (IMF 2013), 2013, : 69 - 82
  • [36] Continuous Result Delta Evaluation of IR Systems
    Gonzalez-Saez, Gabriela
    PROCEEDINGS OF THE 45TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '22), 2022, : 3493 - 3493
  • [37] Sensitivity of IR systems evaluation to topic difficulty
    National Institute of Informatics , 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo
    101-8430, Japan
    不详
    182-8525, Japan
    Proc. Int. Conf. Lang. Resourc. Eval., LREC, (585-589):
  • [38] IR Research: Systems, Interaction, Evaluation and Theories
    Jarvelin, Kalervo
    ADVANCES IN INFORMATION RETRIEVAL, 2011, 6611 : 1 - 3
  • [39] Median measure: an approach to IR systems evaluation
    Greisdorf, H
    Spink, A
    INFORMATION PROCESSING & MANAGEMENT, 2001, 37 (06) : 843 - 857
  • [40] BAYESIAN RELIABILITY EVALUATION OF COMPUTER-SYSTEMS
    SIMKINS, DJ
    BUKOWSKI, JV
    COMPUTERS & ELECTRICAL ENGINEERING, 1984, 11 (2-3) : 79 - 86