Extended a Priori Probability (EAPP): A Data-Driven Approach for Machine Learning Binary Classification Tasks

被引:0
|
作者
Ortiz Castello, Vicent [1 ]
Perez-Benito, Francisco Javier [1 ]
Del Tejo Catala, Omar [1 ]
Salvador Igual, Ismael [1 ]
Llobet, Rafael [1 ,2 ]
Perez-Cortes, Juan-Carlos [1 ,3 ]
机构
[1] Univ Politecn Valencia, Inst Tecnol Informat ITI, Valencia 46022, Spain
[2] Univ Politecn Valencia, Dept Comp Syst & Computat DSIC, Valencia 46022, Spain
[3] Univ Politecn Valencia, Dept Comp Engn DISCA, E-46022 Valencia, Spain
关键词
A~priori probability; EAPP; clustering; autoencoder; semisupervised; combinatory; bias;
D O I
10.1109/ACCESS.2022.3221936
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The a priori probability of a dataset is usually used as a baseline for comparing a particular algorithm's accuracy in a given binary classification task. ZeroR is the simplest algorithm for this, predicting the majority class for all examples. However, this is an extremely simple approach that has no predictive power and does not describe other dataset features that could lead to a more demanding baseline. In this paper, we present the Extended A Priori Probability (EAPP), a novel semi-supervised baseline metric for binary classification tasks that considers not only the a priori probability but also some possible bias present in the dataset as well as other features that could provide a relatively trivial separability of the target classes. The approach is based on the area under the ROC curve (AUC ROC), known to be quite insensitive to class imbalance. The procedure involves multiobjective feature extraction and a clustering stage in the input space with autoencoders and a subsequent combinatory weighted assignation from clusters to classes depending on the distance to nearest clusters for each class. Class labels are then assigned to establish the combination that maximizes AUC ROC for each number of clusters considered. To avoid overfit in the combined feature extraction and clustering method, a cross-validation scheme is performed in each case. EAPP is defined for different numbers of clusters, starting from the inverse of the minority class proportion, which is useful for a fair comparison among diversely imbalanced datasets. A high EAPP usually relates to an easy binary classification task, but it also may be due to a significant coarse-grained bias in the dataset, when the task is previously known to be difficult. This metric represents a baseline beyond the a priori probability to assess the actual capabilities of binary classification models.
引用
收藏
页码:120074 / 120085
页数:12
相关论文
共 50 条
  • [1] A Data-Driven Approach to A Priori SNR Estimation
    Suhadi, Suhadi
    Last, Carsten
    Fingscheidt, Tim
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2011, 19 (01): : 186 - 195
  • [2] Data-Driven Consensus Protocol Classification Using Machine Learning
    Marcozzi, Marco
    Filatovas, Ernestas
    Stripinis, Linas
    Paulavicius, Remigijus
    [J]. MATHEMATICS, 2024, 12 (02)
  • [3] Machine Learning based Psychology: Advocating for A Data-Driven Approach
    Velez, Jorge I.
    [J]. INTERNATIONAL JOURNAL OF PSYCHOLOGICAL RESEARCH, 2021, 14 (01): : 6 - 11
  • [4] Clustering suicides: A data-driven, exploratory machine learning approach
    Ludwig, Birgit
    Koenig, Daniel
    Kapusta, Nestor D.
    Blueml, Victor
    Dorffner, Georg
    Vyssoki, Benjamin
    [J]. EUROPEAN PSYCHIATRY, 2019, 62 : 15 - 19
  • [5] Prediction of casing damage: A data-driven, machine learning approach
    Zhao, Yanhong
    Jiang, Hanqiao
    Li, Hongqi
    [J]. International Journal of Circuits, Systems and Signal Processing, 2020, 14 : 1047 - 1053
  • [6] Classification of machine learning frameworks for data-driven thermal fluid models
    Chang, Chih-Wei
    Dinh, Nam T.
    [J]. INTERNATIONAL JOURNAL OF THERMAL SCIENCES, 2019, 135 : 559 - 579
  • [7] Classification of Nonmetallic Inclusions in Steel by Data-Driven Machine Learning Methods
    Babu, Shashank Ramesh
    Musi, Robert
    Thiele, Kathrin
    Michelic, Susanne K.
    [J]. STEEL RESEARCH INTERNATIONAL, 2023, 94 (01)
  • [8] AN APPROACH TO DATA-DRIVEN LEARNING
    MARKOV, Z
    [J]. LECTURE NOTES IN ARTIFICIAL INTELLIGENCE, 1991, 535 : 127 - 140
  • [9] Data Driven Approach for Eye Disease Classification with Machine Learning
    Malik, Sadaf
    Kanwal, Nadia
    Asghar, Mamoona Naveed
    Sadiq, Mohammad Ali A.
    Karamat, Irfan
    Fleury, Martin
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (14):
  • [10] A logical approach to data-driven classification
    Osswald, R
    Petersen, W
    [J]. KI 2003: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2003, 2821 : 267 - 281