Machine Learning Metrics for Network Datasets Evaluation

被引:0
|
作者
Soukup, Dominik [1 ]
Uhricek, Daniel [2 ]
Vasata, Daniel [1 ]
Cejka, Tomas [3 ]
机构
[1] Czech Tech Univ, Fac Informat Technol, Prague, Czech Republic
[2] Brno Univ Technol, Fac Informat Technol, Brno, Czech Republic
[3] CESNET Ale, Prague, Czech Republic
关键词
D O I
10.1007/978-3-031-56326-3_22
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
High-quality datasets are an essential requirement for leveraging machine learning (ML) in data processing and recently in network security as well. However, the quality of datasets is overlooked or underestimated very often. Having reliable metrics to measure and describe the input dataset enables the feasibility assessment of a dataset. Imperfect datasets may require optimization or updating, e.g., by including more data and merging class labels. Applying ML algorithms will not bring practical value if a dataset does not contain enough information. This work addresses the neglected topics of dataset evaluation and missing metrics. We propose three novel metrics to estimate the quality of an input dataset and help with its improvement or building a new dataset. This paper describes experiments performed on public datasets to show the benefits of the proposed metrics and theoretical definitions for more straightforward interpretation. Additionally, we have implemented and published Python code so that the metrics can be adopted by the worldwide scientific community.
引用
收藏
页码:307 / 320
页数:14
相关论文
共 50 条
  • [1] A review of machine transliteration, translation, evaluation metrics and datasets in Indian Languages
    Jha, Abhinav
    Patil, Hemprasad Yashwant
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (15) : 23509 - 23540
  • [2] A review of machine transliteration, translation, evaluation metrics and datasets in Indian Languages
    Abhinav Jha
    Hemprasad Yashwant Patil
    Multimedia Tools and Applications, 2023, 82 : 23509 - 23540
  • [3] Evaluation metrics and statistical tests for machine learning
    Rainio, Oona
    Teuho, Jarmo
    Klen, Riku
    SCIENTIFIC REPORTS, 2024, 14 (01)
  • [4] Machine learning using synthetic and real data: Similarity of evaluation metrics for different healthcare datasets and for different algorithms
    Heyburn, Rachel
    Bond, Raymond R.
    Black, Michaela
    Mulvenna, Maurice
    Wallace, Jonathan
    Rankin, Deborah
    Cleland, Brian
    DATA SCIENCE AND KNOWLEDGE ENGINEERING FOR SENSING DECISION SUPPORT, 2018, 11 : 1281 - 1291
  • [5] GPU-based similarity metrics computation and machine learning approaches for string similarity evaluation in large datasets
    Aurel Baloi
    Bogdan Belean
    Flaviu Turcu
    Daniel Peptenatu
    Soft Computing, 2024, 28 : 3465 - 3477
  • [6] GPU-based similarity metrics computation and machine learning approaches for string similarity evaluation in large datasets
    Baloi, Aurel
    Belean, Bogdan
    Turcu, Flaviu
    Peptenatu, Daniel
    SOFT COMPUTING, 2024, 28 (04) : 3465 - 3477
  • [7] A Survey on Machine Reading Comprehension-Tasks, Evaluation Metrics and Benchmark Datasets
    Zeng, Changchang
    Li, Shaobo
    Li, Qin
    Hu, Jie
    Hu, Jianjun
    APPLIED SCIENCES-BASEL, 2020, 10 (21): : 1 - 57
  • [8] Investigating Network Intrusion Detection Datasets Using Machine Learning
    Amaizu, Gabriel Chukwunonso
    Nwakanma, Cosmas Ifeanyi
    Lee, Jae-Min
    Kim, Dong-Seong
    11TH INTERNATIONAL CONFERENCE ON ICT CONVERGENCE: DATA, NETWORK, AND AI IN THE AGE OF UNTACT (ICTC 2020), 2020, : 1325 - 1328
  • [9] Evaluation of Robustness Metrics for Defense of Machine Learning Systems
    DeMarchi, J.
    Rijken, R.
    Melrose, J.
    Madahar, B.
    Fumera, G.
    Roli, F.
    Ledda, E.
    Aktas, M.
    Kurth, F.
    Baggenstoss, P.
    Pelzer, B.
    Kanestad, L.
    2023 INTERNATIONAL CONFERENCE ON MILITARY COMMUNICATIONS AND INFORMATION SYSTEMS, ICMCIS, 2023,
  • [10] Video Description: Datasets & Evaluation Metrics
    Rafiq, Muhammad
    Rafiq, Ghazala
    Choi, Gyu Sang
    IEEE ACCESS, 2021, 9 : 121665 - 121685