Statistical Significance Testing in Theory and in Practice

被引:0
|
作者
Carterette, Ben [1 ]
机构
[1] Spotify, Stockholm, Sweden
基金
美国国家科学基金会;
关键词
D O I
10.1145/3341981.3358959
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The past 25 years have seen a great improvement in the rigor of experimentation on information access problems. This is due primarily to three factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtreval Conference [39]), the increased ease of online A/B testing on large user populations, and the increased practice of statistical hypothesis testing to determine whether observed improvements can be ascribed to something other than random chance. Together these create a very useful standard for reviewers, program committees, and journal editors; work on information access (IA) problems such as search and recommendation increasingly cannot be published unless it has been evaluated offline using a well-constructed test collection or online on a large user base and shown to produce a statistically significant improvement over a good baseline. But, as the saying goes, any tool sharp enough to be useful is also sharp enough to be dangerous. Statistical tests of significance are widely misunderstood. Most researchers and developers treat them as a "black box": evaluation results go in and a p-value comes out. But because significance is such an important factor in determining what directions to explore and what is published or deployed, using p-values obtained without thought can have consequences for everyone working in IA. Ioannidis has argued that the main consequence in the biomedical sciences is that most published research findings are false [21]; could that be the case for IA as well?
引用
收藏
页码:256 / 258
页数:3
相关论文
共 50 条
  • [1] Statistical significance testing in theory and in practice
    Carterette, Ben
    [J]. ICTIR 2019 - Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, 2019, : 257 - 259
  • [2] Statistical Significance Testing in Information Retrieval: Theory and Practice
    Carterette, Ben
    [J]. SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 1286 - 1286
  • [3] Statistical Significance Testing in Information Retrieval: Theory and Practice
    Carterette, Ben
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 1387 - 1389
  • [4] Significance Testing in Theory and Practice
    Greco, Daniel
    [J]. BRITISH JOURNAL FOR THE PHILOSOPHY OF SCIENCE, 2011, 62 (03): : 607 - 637
  • [5] Three Roles for Statistical Significance and the Validity Frontier in Theory Testing
    Lee, Allen S.
    Mohajeri, Kaveh
    Hubona, Geoffrey S.
    [J]. PROCEEDINGS OF THE 50TH ANNUAL HAWAII INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES, 2017, : 5737 - 5746
  • [6] Statistical significance testing as it relates to practice:: Use within professional psychology:: Research and practice
    Vacha-Haase, T
    Ness, CM
    [J]. PROFESSIONAL PSYCHOLOGY-RESEARCH AND PRACTICE, 1999, 30 (01) : 104 - 105
  • [7] Statistical significance testing, hypothetico-deductive method, and theory evaluation
    Haig, BD
    [J]. BEHAVIORAL AND BRAIN SCIENCES, 2000, 23 (02) : 292 - +
  • [8] The insignificance of statistical significance testing
    Johnson, DH
    [J]. JOURNAL OF WILDLIFE MANAGEMENT, 1999, 63 (03): : 763 - 772
  • [9] PERSPECTIVES ON STATISTICAL SIGNIFICANCE TESTING
    WOOLSON, RF
    KLEINMAN, JC
    [J]. ANNUAL REVIEW OF PUBLIC HEALTH, 1989, 10 : 423 - 440
  • [10] Retirement of statistical significance testing
    Klungel, Olaf
    Rothman, Kenneth J.
    Hillege, Hans
    Fletcher, John
    [J]. PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2020, 29 : 10 - 10