A resampling-based method to evaluate NLI models

被引：0

作者：

Salvatore, Felipe de Souza ^{[1
]}

Finger, Marcelo ^{[1
]}

Hirata Jr, Roberto ^{[1
]}

Patriota, Alexandre G. ^{[1
]}

机构：

[1] Univ Sao Paulo, Inst Matemat & Estat, Sao Paulo, Brazil

来源：

NATURAL LANGUAGE ENGINEERING | 2024年 / 30卷 / 04期

基金：

巴西圣保罗研究基金会;

关键词：

Textual entailment; Text classification; Statistical methods; Machine learning; Evaluation;

D O I：

10.1017/S1351324923000268

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The recent progress of deep learning techniques has produced models capable of achieving high scores on traditional Natural Language Inference (NLI) datasets. To understand the generalization limits of these powerful models, an increasing number of adversarial evaluation schemes have appeared. These works use a similar evaluation method: they construct a new NLI test set based on sentences with known logic and semantic properties (the adversarial set), train a model on a benchmark NLI dataset, and evaluate it in the new set. Poor performance on the adversarial set is identified as a model limitation. The problem with this evaluation procedure is that it may only indicate a sampling problem. A machine learning model can perform poorly on a new test set because the text patterns presented in the adversarial set are not well represented in the training sample. To address this problem, we present a new evaluation method, the Invariance under Equivalence test (IE test). The IE test trains a model with sufficient adversarial examples and checks the model's performance on two equivalent datasets. As a case study, we apply the IE test to the state-of-the-art NLI models using synonym substitution as the form of adversarial examples. The experiment shows that, despite their high predictive power, these models usually produce different inference outputs for equivalent inputs, and, more importantly, this deficiency cannot be solved by adding adversarial observations in the training data.

引用

页码：793 / 820

页数：28

共 50 条

[21] Resampling-based multiple testing for microarray data analysis
Youngchao Ge
Sandrine Dudoit
Terence P. Speed
Test, 2003, 12 : 1 - 77
[22] Resampling-based multiple testing for microarray data analysis
Ge, YC
Dudoit, S
Speed, TP
TEST, 2003, 12 (01) : 1 - 77
[23] Resampling-based software for estimating optimal sample size
Confalonieri, R.
Acutis, M.
Bellocchi, G.
Genovese, G.
ENVIRONMENTAL MODELLING & SOFTWARE, 2007, 22 (12) : 1796 - 1800
[24] Assessment of Person Fit Using Resampling-Based Approaches
Sinharay, Sandip
JOURNAL OF EDUCATIONAL MEASUREMENT, 2016, 53 (01) : 63 - 85
[25] Resampling-based Classification Using Depth for Functional Curves
Kwon, Amy M.
Ouyang, Ming
Cheng, Andrew
COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2016, 45 (09) : 3329 - 3338
[26] AdvRefactor: A Resampling-Based Defense Against Adversarial Attacks
Jiang, Jianguo
Li, Boquan
Yu, Min
Liu, Chao
Sun, Jianguo
Huang, Weiqing
Lv, Zhiqiang
ADVANCES IN MULTIMEDIA INFORMATION PROCESSING - PCM 2018, PT II, 2018, 11165 : 815 - 825
[27] Resampling-Based Inference Methods for Comparing Two Coefficients Alpha
Markus Pauly
Maria Umlauft
Ali Ünlü
Psychometrika, 2018, 83 : 203 - 222
[28] Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data
Stefano Monti
Pablo Tamayo
Jill Mesirov
Todd Golub
Machine Learning, 2003, 52 : 91 - 118
[29] Detecting which variables alter component interpretation across multiple groups: A resampling-based method
Gvaladze, Sopiko
De Roover, Kim
Tuerlinckx, Francis
Ceulemans, Eva
BEHAVIOR RESEARCH METHODS, 2020, 52 (01) : 236 - 263
[30] Resampling-Based Inference Methods for Comparing Two Coefficients Alpha
Pauly, Markus
Umlauft, Maria
Uenlue, Ali
PSYCHOMETRIKA, 2018, 83 (01) : 203 - 222

← 1 2 3 4 5 →