Subgroup mining for performance analysis of regression models

被引:3
|
作者
Pimentel, Joao [1 ]
Azevedo, Paulo J. [1 ,2 ]
Torgo, Luis [1 ,3 ]
机构
[1] Univ Minho, Dept Informat, Braga, Portugal
[2] INESCT TEC, P-4050190 Porto, Portugal
[3] Dalhousie Univ, Fac Comp Sci, Halifax, NS, Canada
关键词
interpretability; machine learning; performance; regression;
D O I
10.1111/exsy.13118
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning algorithms have shown several advantages compared to humans, namely in terms of the scale of data that can be analysed, delivering high speed and precision. However, it is not always possible to understand how algorithms work. As a result of the complexity of some algorithms, users started to feel the need to ask for explanations, boosting the relevance of Explainable Artificial Intelligence. This field aims to explain and interpret models with the use of specific analytical methods that usually analyse how their predicted values and/or errors behave. While prediction analysis is widely studied, performance analysis has limitations for regression models. This paper proposes a rule-based approach, Error Distribution Rules (EDRs), to uncover atypical error regions, while considering multivariate feature interactions without size restrictions. Extracting EDRs is a form of subgroup mining. EDRs are model agnostic and a drill-down technique to evaluate regression models, which consider multivariate interactions between predictors. EDRs uncover regions of the input space with deviating performance providing an interpretable description of these regions. They can be regarded as a complementary tool to the standard reporting of the expected average predictive performance. Moreover, by providing interpretable descriptions of these specific regions, EDRs allow end users to understand the dangers of using regression tools for some specific cases that fall on these regions, that is, they improve the accountability of models. The performance of several models from different problems was studied, showing that our proposal allows the analysis of many situations and direct model comparison. In order to facilitate the examination of rules, two visualization tools based on boxplots and density plots were implemented. A network visualization tool is also provided to rapidly check interactions of every feature condition. An additional tool is provided by using a grid of boxplots, where comparison between quartiles of every distribution with a reference is performed. Based on this comparison, an extrapolation of counterfactual examples to regression was also implemented. A set of examples is described, including a setting where regression models performance is compared in detail using EDRs. Specifically, the error difference between two models in a dataset is studied by deriving rules highlighting regions of the input space where model performance difference is unexpected. The application of visual tools is illustrated using EDRs examples derived from public available datasets. Also, case studies illustrating the specialization of subgroups, identification of counter factual subgroups and detecting unanticipated complex models are presented. This paper extends the state of the art by providing a method to derive explanations for model performance instead of explanations for model predictions.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Bayesian subgroup analysis in regression using mixture models
    Im, Yunju
    Tan, Aixin
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2021, 162
  • [2] RegressionExplorer: Interactive Exploration of Logistic Regression Models with Subgroup Analysis
    Dingen, Dennis
    van't Veer, Marcel
    Houthuizen, Patrick
    Mestrom, Eveline H. J.
    Korsten, Erik H. H. M.
    Bouwman, Arthur R. A.
    van Wijk, Jarke
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2019, 25 (01) : 246 - 255
  • [3] Towards Comprehensive Subgroup Performance Analysis in Speech Models
    Koudounas, Alkis
    Pastor, Eliana
    Attanasio, Giuseppe
    Mazzia, Vittorio
    Giollo, Manuel
    Gueudre, Thomas
    Reale, Elisa
    Cagliero, Luca
    Cumani, Sandro
    de Alfaro, Luca
    Baralis, Elena
    Amberti, Daniele
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 1468 - 1480
  • [4] SUBGROUP ANALYSIS IN CENSORED LINEAR REGRESSION
    Yan, Xiaodong
    Yin, Guosheng
    Zhao, Xingqiu
    STATISTICA SINICA, 2021, 31 (02) : 1027 - 1054
  • [5] Mining models of composite Web services for performance analysis
    Gao, Aiqiang
    Yang, Dongqing
    Tang, Shiwei
    Zhang, Ming
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2006, 3882 : 828 - 837
  • [6] Set-regression with applications to subgroup analysis
    Yuan, Ao
    Wang, Lida
    Tan, Ming T.
    STATISTICS IN MEDICINE, 2022, 41 (01) : 180 - 193
  • [7] Subgroup mining
    Klösgen, W
    COMPUTATIONAL INTELLIGENCE IN DATA MINING, 2000, (408): : 39 - 49
  • [8] Construction and use of linear regression models for processor performance analysis
    Joseph, P. J.
    Vaswani, Kapil
    Thazhuthaveetil, Matthew J.
    TWELFTH INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE COMPUTER ARCHITECTURE, PROCEEDINGS, 2006, : 99 - +
  • [9] Regression Analysis of Doubly Censored Data with a Cured Subgroup under a Class of Promotion Time Cure Models
    Cai, Min
    Xiao, Li Qun
    Li, Shu Wei
    ACTA MATHEMATICA SINICA-ENGLISH SERIES, 2021, 37 (06) : 835 - 853
  • [10] Regression Analysis of Doubly Censored Data with a Cured Subgroup under a Class of Promotion Time Cure Models
    Min Cai
    Li Qun Xiao
    Shu Wei Li
    Acta Mathematica Sinica, English Series, 2021, 37 : 835 - 853