Machine Learning Metrics for Network Datasets Evaluation

被引:0
|
作者
Soukup, Dominik [1 ]
Uhricek, Daniel [2 ]
Vasata, Daniel [1 ]
Cejka, Tomas [3 ]
机构
[1] Czech Tech Univ, Fac Informat Technol, Prague, Czech Republic
[2] Brno Univ Technol, Fac Informat Technol, Brno, Czech Republic
[3] CESNET Ale, Prague, Czech Republic
关键词
D O I
10.1007/978-3-031-56326-3_22
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
High-quality datasets are an essential requirement for leveraging machine learning (ML) in data processing and recently in network security as well. However, the quality of datasets is overlooked or underestimated very often. Having reliable metrics to measure and describe the input dataset enables the feasibility assessment of a dataset. Imperfect datasets may require optimization or updating, e.g., by including more data and merging class labels. Applying ML algorithms will not bring practical value if a dataset does not contain enough information. This work addresses the neglected topics of dataset evaluation and missing metrics. We propose three novel metrics to estimate the quality of an input dataset and help with its improvement or building a new dataset. This paper describes experiments performed on public datasets to show the benefits of the proposed metrics and theoretical definitions for more straightforward interpretation. Additionally, we have implemented and published Python code so that the metrics can be adopted by the worldwide scientific community.
引用
收藏
页码:307 / 320
页数:14
相关论文
共 50 条
  • [21] QDataSet, quantum datasets for machine learning
    Elija Perrier
    Akram Youssry
    Chris Ferrie
    Scientific Data, 9
  • [22] Comparison of Visual Datasets for Machine Learning
    Gauen, Kent
    Dailey, Ryan
    Laiman, John
    Zi, Yuxiang
    Asokan, Nirmal
    Lu, Yung-Hsiang
    Thiruvathukal, George K.
    Shyu, Mei-Ling
    Chen, Shu-Ching
    2017 IEEE 18TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IEEE IRI 2017), 2017, : 346 - 355
  • [23] Datasets with rich labels for machine learning
    Hoarau, Arthur
    Thierry, Constance
    Martin, Arnaud
    Dubois, Jean-Christophe
    Le Gall, Yolande
    2023 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, FUZZ, 2023,
  • [24] Image Watermarking for Machine Learning Datasets
    Maesen, Palle
    Isler, Devris
    Laoutaris, Nikolaos
    Erkin, Zekeriya
    PROCEEDINGS OF THE 2ND ACM DATA ECONOMY WORKSHOP, DEC 2023, 2023, : 7 - 13
  • [25] Deep learning and knowledge graph for image/video captioning: A review of datasets, evaluation metrics, and methods
    Wajid, Mohammad Saif
    Terashima-Marin, Hugo
    Najafirad, Peyman
    Wajid, Mohd Anas
    ENGINEERING REPORTS, 2024, 6 (01)
  • [26] Morse Code Datasets for Machine Learning
    Dey, Sourya
    Chugg, Keith M.
    Beerel, Peter A.
    2018 9TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2018,
  • [27] QDataSet, quantum datasets for machine learning
    Perrier, Elija
    Youssry, Akram
    Ferrie, Chris
    SCIENTIFIC DATA, 2022, 9 (01)
  • [28] An Evaluation of Federated Learning Techniques for Secure and Privacy-Preserving Machine Learning on Medical Datasets
    Korkmaz, Abdulkadir
    Alhonainy, Ahmad
    Rao, Praveen
    2022 IEEE APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, AIPR, 2022,
  • [29] Using Machine Learning and In-band Network Telemetry for Service Metrics Estimation
    de Almeida, Leandro C.
    Pasquini, Rafael
    Verdi, Fabio L.
    2021 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (IEEE CLOUDNET), 2021, : 33 - 39
  • [30] Comparison of Evaluation Metrics in Classification Applications with Imbalanced Datasets
    Fatourechi, Mehrdad
    Ward, Rabab K.
    Mason, Steven G.
    Huggins, Jane
    Schloegl, Alois
    Birch, Gary E.
    SEVENTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2008, : 777 - +