Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluation

被引：0

作者：

Pereira, Joao Luiz Junho ^{[1
]}

Smith-Miles, Kate ^{[2
]}

Munoz, Mario Andres ^{[3
]}

Lorena, Ana Carolina ^{[1
]}

机构：

[1] Inst Tecnol Aeronaut, Sao Jose Dos Campos, Brazil

[2] Univ Melbourne, Sch Math & Stat, Melbourne, Australia

[3] Univ Melbourne, Sch Comp & Informat Syst, Melbourne, Australia

来源：

DATA MINING AND KNOWLEDGE DISCOVERY | 2024年 / 38卷 / 02期

基金：

澳大利亚研究理事会; 巴西圣保罗研究基金会;

关键词：

Benchmark datasets' suites; Instance space analysis; Classification algorithms; Regression algorithms; Meta-learning; Optimization; UCI PLUS; CLASSIFIERS; REPOSITORY;

D O I：

10.1007/s10618-023-00957-1

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Whenever a new supervised machine learning (ML) algorithm or solution is developed, it is imperative to evaluate the predictive performance it attains for diverse datasets. This is done in order to stress test the strengths and weaknesses of the novel algorithms and provide evidence for situations in which they are most useful. A common practice is to gather some datasets from public benchmark repositories for such an evaluation. But little or no specific criteria are used in the selection of these datasets, which is often ad-hoc. In this paper, the importance of gathering a diverse benchmark of datasets in order to properly evaluate ML models and really understand their capabilities is investigated. Leveraging from meta-learning studies evaluating the diversity of public repositories of datasets, this paper introduces an optimization method to choose varied classification and regression datasets from a pool of candidate datasets. The method is based on maximum coverage, circular packing, and the meta-heuristic Lichtenberg Algorithm for ensuring that diverse datasets able to challenge the ML algorithms more broadly are chosen. The selections were compared experimentally with a random selection of datasets and with clustering by k-medoids and proved to be more effective regarding the diversity of the chosen benchmarks and the ability to challenge the ML algorithms at different levels.

引用

页码：461 / 500

页数：40

共 50 条

[1] Optimal selection of benchmarking datasets for unbiased machine learning algorithm evaluation
João Luiz Junho Pereira
Kate Smith-Miles
Mario Andrés Muñoz
Ana Carolina Lorena
[J]. Data Mining and Knowledge Discovery, 2024, 38 : 461 - 500
[2] TocoDecoy: A New Approach to Design Unbiased Datasets for Training and Benchmarking Machine-Learning Scoring Functions
Zhang, Xujun
Shen, Chao
Liao, Ben
Jiang, Dejun
Wang, Jike
Wu, Zhenxing
Du, Hongyan
Wang, Tianyue
Huo, Wenbo
Xu, Lei
Cao, Dongsheng
Hsieh, Chang-Yu
Hou, Tingjun
[J]. JOURNAL OF MEDICINAL CHEMISTRY, 2022, 65 (11) : 7918 - 7932
[3] An evaluation of machine learning in algorithm selection for search problems
Kotthoff, Lars
Gent, Ian P.
Miguel, Ian
[J]. AI COMMUNICATIONS, 2012, 25 (03) : 257 - 270
[4] Machine Learning Metrics for Network Datasets Evaluation
Soukup, Dominik
Uhricek, Daniel
Vasata, Daniel
Cejka, Tomas
[J]. ICT SYSTEMS SECURITY AND PRIVACY PROTECTION, IFIP SEC 2023, 2024, 679 : 307 - 320
[5] Towards Benchmarking for Evaluating Machine Learning Methods in Detecting Outliers in Process Datasets
Schindler, Thimo F.
Schlicht, Simon
Thoben, Klaus-Dieter
[J]. COMPUTERS, 2023, 12 (12)
[6] Contemplation of Machine Learning Algorithm under Distinct Datasets
Shah, Kushagra
Chaturvedi, Pradhyumn
Jain, Akagra
[J]. 2018 INTERNATIONAL CONFERENCE ON ADVANCED COMPUTATION AND TELECOMMUNICATION (ICACAT), 2018,
[7] Algorithm Selection and Model Evaluation in Application Design Using Machine Learning
Bethu, Srikanth
Babu, B. Sankara
Madhavi, K.
Krishna, P. Gopala
[J]. MACHINE LEARNING FOR NETWORKING (MLN 2019), 2020, 12081 : 175 - 195
[8] Implementation and Evaluation of an Optimal Algorithm for Neural Networks Association in Machine Learning
Balapriya, S.
Srinivasan, N.
[J]. INTERNATIONAL TRANSACTION JOURNAL OF ENGINEERING MANAGEMENT & APPLIED SCIENCES & TECHNOLOGIES, 2022, 13 (06):
[9] Decoys Selection in Benchmarking Datasets: Overview and Perspectives
Reau, Manon
Langenfeld, Florent
Zagury, Jean-Francois
Lagarde, Nathalie
Montes, Matthieu
[J]. FRONTIERS IN PHARMACOLOGY, 2018, 9
[10] A Clustering Hybrid Algorithm for Smart Datasets using Machine Learning
Amin, Dar Masroof
Rai, Munishwar
[J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (09) : 165 - 172

← 1 2 3 4 5 →