Comparing Test Sets with Item Response Theory

被引:0
|
作者
Vania, Clara [1 ,2 ]
Htut, Phu Mon [2 ]
Huang, William [2 ]
Mungra, Dhara [2 ]
Pang, Richard Yuanzhe [2 ]
Phang, Jason [2 ]
Liu, Haokun [3 ]
Cho, Kyunghyun [2 ]
Bowman, Samuel R. [2 ]
机构
[1] Amazon, Seattle, WA 98108 USA
[2] NYU, New York, NY 10003 USA
[3] Allen Inst AI, Seattle, WA USA
基金
美国国家科学基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
引用
收藏
页码:1141 / 1158
页数:18
相关论文
共 50 条
  • [1] Comparing Ontology-Based and Item response theory in Computer adaptive test
    Khater, Eman
    Hegazy, Abdelfatah
    Shehab, M. Elemam.
    [J]. 2015 IEEE SEVENTH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND INFORMATION SYSTEMS (ICICIS), 2015, : 187 - 195
  • [2] Item Analysis of Test of Proficiency in Korean: Classical Test Theory and Item Response Theory
    Yu, Minae
    Kim, Hyunah
    [J]. KOREAN LANGUAGE IN AMERICA, 2019, 23 (01): : 1 - 26
  • [3] Comparing Differential Item Functioning Based on Multilevel Mixture Item Response Theory, Mixture Item Response Theory and Manifest Groups
    Dogan, Oemer
    Atar, Burcu
    [J]. JOURNAL OF MEASUREMENT AND EVALUATION IN EDUCATION AND PSYCHOLOGY-EPOD, 2024, 15 (02): : 120 - 137
  • [4] TEST THEORIES: CLASSICAL THEORY AND ITEM RESPONSE THEORY
    Muniz, Jose
    [J]. PAPELES DEL PSICOLOGO, 2010, 31 (01): : 57 - 66
  • [5] Comparing the difficulty of examination subjects with item response theory
    Korobko, Oksana B.
    Glas, Cees A. W.
    Bosker, Roel J.
    Luyten, Johan W.
    [J]. JOURNAL OF EDUCATIONAL MEASUREMENT, 2008, 45 (02) : 139 - 157
  • [6] Comparing multiunidimensional and unidimentional item response theory models
    Sheng, Yanyan
    [J]. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 2007, 67 (06) : 899 - 919
  • [7] Item analysis of the Child Neuropsychological Assessment Test (TENI): Classical test theory and item response theory
    Martins, Pedro S. R.
    Barbosa-Pereira, Drielle
    Valgas-Costa, Marli
    Mansur-Alves, Marcela
    [J]. APPLIED NEUROPSYCHOLOGY-CHILD, 2022, 11 (03) : 339 - 349
  • [8] Item response theory and classical test theory: An empirical comparison of their item/person statistics
    Fan, XT
    [J]. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1998, 58 (03) : 357 - 381
  • [9] Applying Item Response Theory in Language Test Item Bank Building
    Huang, Jinyan
    [J]. MODERN LANGUAGE JOURNAL, 2010, 94 (02): : 374 - 375
  • [10] Using classical test theory in combination with item response theory
    Bechger, TM
    Maris, G
    Verstralen, HHFM
    Béguin, AA
    [J]. APPLIED PSYCHOLOGICAL MEASUREMENT, 2003, 27 (05) : 319 - 334