Machine Learning Metrics for Network Datasets Evaluation

被引：0

作者：

Soukup, Dominik ^{[1
]}

Uhricek, Daniel ^{[2
]}

Vasata, Daniel ^{[1
]}

Cejka, Tomas ^{[3
]}

机构：

[1] Czech Tech Univ, Fac Informat Technol, Prague, Czech Republic

[2] Brno Univ Technol, Fac Informat Technol, Brno, Czech Republic

[3] CESNET Ale, Prague, Czech Republic

来源：

ICT SYSTEMS SECURITY AND PRIVACY PROTECTION, IFIP SEC 2023 | 2024年 / 679卷

关键词：

D O I：

10.1007/978-3-031-56326-3_22

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

High-quality datasets are an essential requirement for leveraging machine learning (ML) in data processing and recently in network security as well. However, the quality of datasets is overlooked or underestimated very often. Having reliable metrics to measure and describe the input dataset enables the feasibility assessment of a dataset. Imperfect datasets may require optimization or updating, e.g., by including more data and merging class labels. Applying ML algorithms will not bring practical value if a dataset does not contain enough information. This work addresses the neglected topics of dataset evaluation and missing metrics. We propose three novel metrics to estimate the quality of an input dataset and help with its improvement or building a new dataset. This paper describes experiments performed on public datasets to show the benefits of the proposed metrics and theoretical definitions for more straightforward interpretation. Additionally, we have implemented and published Python code so that the metrics can be adopted by the worldwide scientific community.

引用

页码：307 / 320

页数：14

共 50 条

[21] QDataSet, quantum datasets for machine learning
Elija Perrier
Akram Youssry
Chris Ferrie
Scientific Data, 9
[22] Comparison of Visual Datasets for Machine Learning
Gauen, Kent
Dailey, Ryan
Laiman, John
Zi, Yuxiang
Asokan, Nirmal
Lu, Yung-Hsiang
Thiruvathukal, George K.
Shyu, Mei-Ling
Chen, Shu-Ching
2017 IEEE 18TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IEEE IRI 2017), 2017, : 346 - 355
[23] Datasets with rich labels for machine learning
Hoarau, Arthur
Thierry, Constance
Martin, Arnaud
Dubois, Jean-Christophe
Le Gall, Yolande
2023 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, FUZZ, 2023,
[24] Image Watermarking for Machine Learning Datasets
Maesen, Palle
Isler, Devris
Laoutaris, Nikolaos
Erkin, Zekeriya
PROCEEDINGS OF THE 2ND ACM DATA ECONOMY WORKSHOP, DEC 2023, 2023, : 7 - 13
[25] Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods
Wajid, Mohammad Saif
Terashima-Marin, Hugo
Najafirad, Peyman
Wajid, Mohd Anas
ENGINEERING REPORTS, 2024, 6 (01)
[26] Morse Code Datasets for Machine Learning
Dey, Sourya
Chugg, Keith M.
Beerel, Peter A.
2018 9TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2018,
[27] QDataSet, quantum datasets for machine learning
Perrier, Elija
Youssry, Akram
Ferrie, Chris
SCIENTIFIC DATA, 2022, 9 (01)
[28] An Evaluation of Federated Learning Techniques for Secure and Privacy-Preserving Machine Learning on Medical Datasets
Korkmaz, Abdulkadir
Alhonainy, Ahmad
Rao, Praveen
2022 IEEE APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, AIPR, 2022,
[29] Using Machine Learning and In-band Network Telemetry for Service Metrics Estimation
de Almeida, Leandro C.
Pasquini, Rafael
Verdi, Fabio L.
2021 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (IEEE CLOUDNET), 2021, : 33 - 39
[30] Comparison of Evaluation Metrics in Classification Applications with Imbalanced Datasets
Fatourechi, Mehrdad
Ward, Rabab K.
Mason, Steven G.
Huggins, Jane
Schloegl, Alois
Birch, Gary E.
SEVENTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2008, : 777 - +

← 1 2 3 4 5 →