Comparing Test Sets with Item Response Theory

被引:0
|
作者
Vania, Clara [1 ,2 ]
Htut, Phu Mon [2 ]
Huang, William [2 ]
Mungra, Dhara [2 ]
Pang, Richard Yuanzhe [2 ]
Phang, Jason [2 ]
Liu, Haokun [3 ]
Cho, Kyunghyun [2 ]
Bowman, Samuel R. [2 ]
机构
[1] Amazon, Seattle, WA 98108 USA
[2] NYU, New York, NY 10003 USA
[3] Allen Inst AI, Seattle, WA USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
引用
收藏
页码:1141 / 1158
页数:18
相关论文
共 50 条
  • [21] The Effects of Test Length and Sample Size on Item Parameters in Item Response Theory
    Sahin, Alper
    Anil, Duygu
    [J]. EDUCATIONAL SCIENCES-THEORY & PRACTICE, 2017, 17 (01): : 321 - 335
  • [22] Item analysis of Combined Raven's test based on item response theory
    Peng, Jiaxi
    Miao, Danmin
    Yang, Yebing
    Jiang, Yuan
    Xiao, Wei
    [J]. Advances in Information Sciences and Service Sciences, 2012, 4 (18): : 357 - 362
  • [23] Evaluation of a Financial Literacy Test Using Classical Test Theory and Item Response Theory
    Kunovskaya, Irina
    Cude, Brenda
    Alexeev, Natalia
    [J]. JOURNAL OF FAMILY AND ECONOMIC ISSUES, 2014, 35 (04) : 516 - 531
  • [24] Item response theory: applications of modern test theory in medical education
    Downing, SM
    [J]. MEDICAL EDUCATION, 2003, 37 (08) : 739 - 745
  • [25] Investigating Subscores of VERA 3 German Test Based on Item Response Theory/Multidimensional Item Response Theory Models
    Temel, Gueler Yavuz
    Machunsky, Maya
    Rietz, Christian
    Okropiridze, Dimitry
    [J]. FRONTIERS IN EDUCATION, 2022, 7
  • [26] Advances in Psychometrics: From Classical Test Theory to Item Response Theory
    Andreoli Sartes, Laisa Marcorela
    Oliveira de Souza-Formigoni, Maria Lucia
    [J]. PSICOLOGIA-REFLEXAO E CRITICA, 2013, 26 (02): : 241 - 250
  • [27] Missing item responses in latent growth analysis: Item response theory versus classical test theory
    Gorter, R.
    Fox, J-P
    Eekhout, I
    Heymans, M. W.
    Twisk, J. W. R.
    [J]. STATISTICAL METHODS IN MEDICAL RESEARCH, 2020, 29 (04) : 996 - 1014
  • [28] RELIABILITY OF TEST-SCORES IN NONPARAMETRIC ITEM RESPONSE THEORY
    SIJTSMA, K
    MOLENAAR, IW
    [J]. PSYCHOMETRIKA, 1987, 52 (01) : 79 - 97
  • [29] Item Response Theory analysis of Fagerstrom Test for Cigarette Dependence
    Svicher, Andrea
    Cosci, Fiammetta
    Giannini, Marco
    Pistelli, Francesco
    Fagerstrom, Karl
    [J]. ADDICTIVE BEHAVIORS, 2018, 77 : 38 - 46
  • [30] Can a multidimensional test be evaluated with unidimensional item response theory?
    Wiberg, Marie
    [J]. EDUCATIONAL RESEARCH AND EVALUATION, 2012, 18 (04) : 307 - 320