Data Curation and Quality Evaluation for Machine Learning-Based Cyber Intrusion Detection

被引:4
|
作者
Tran, Ngan [1 ]
Chen, Haihua [1 ]
Bhuyan, Jay [2 ]
Ding, Junhua [1 ]
机构
[1] Dept Informat Sci, Denton, TX 76203 USA
[2] Tuskegee Univ, Dept Comp Sci, Tuskegee, AL 36088 USA
基金
美国国家科学基金会;
关键词
Data curation; data quality; intrusion detection; machine learning; deep learning; language model; DETECTION SYSTEM; SCHEME;
D O I
10.1109/ACCESS.2022.3211313
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Intrusion detection is an essential task for protecting the cyber environment from attacks. Many studies have proposed sophisticated models to detect intrusions from a large amount of data, yet they ignored the fact that poor data quality has a direct impact on the performance of intrusion detection systems. Examples of poor data quality include mislabeled, inaccurate, incomprehensive, irrelevant, inconsistent, duplicated, and overlapped data. In order to investigate how data quality may affect machine learning performance, we conducted a series of experiments on 11 host-based intrusion datasets using eight machine learning (ML) models and two pre-trained language models BERT and GPT-2. The experimental results showed: 1. BERT and GPT-2 outperformed the other models on every dataset. 2. Data duplications and overlaps in a dataset had different performance impacts on the pre-trained models and the classic ML models. The pre-trained models were less susceptible to duplicate and overlapped data than the classic ML models. 3. Removing overlaps and duplicates from training data with a normal range of sequence similarities could improve the pre-trained models' performances on most datasets. However, it may have adverse effects on model performance in datasets with highly similar sequences. 4. The reliability of model evaluation could be affected when testing data contains duplicates. 5. The overlapped rate between the normal class and the intrusion class seemed to have an inverse relationship to the performance of the pre-trained models in intrusion detection. Given the results, we proposed a framework for model selection and data quality assurance for building a high-quality machine learning-based intrusion detection system.
引用
收藏
页码:121900 / 121923
页数:24
相关论文
共 50 条
  • [1] Hydraulic Data Preprocessing for Machine Learning-Based Intrusion Detection in Cyber-Physical Systems
    Mboweni, Ignitious V.
    Ramotsoela, Daniel T.
    Abu-Mahfouz, Adnan M.
    [J]. MATHEMATICS, 2023, 11 (08)
  • [2] Machine Learning-Based Intrusion Detection System For Healthcare Data
    Balyan, Amit Kumar
    Ahuja, Sachin
    Sharma, Sanjeev Kumar
    Lilhore, Umesh Kumar
    [J]. PROCEEDINGS OF 3RD IEEE CONFERENCE ON VLSI DEVICE, CIRCUIT AND SYSTEM (IEEE VLSI DCS 2022), 2022, : 290 - 294
  • [3] Machine learning-based intrusion detection algorithms
    Tang, Hua
    Cao, Zhuolin
    [J]. Journal of Computational Information Systems, 2009, 5 (06): : 1825 - 1831
  • [4] Design and Performance Evaluation of a Machine Learning-Based Method for Intrusion Detection
    Zhang, Qinglei
    Hu, Gongzhu
    Feng, Wenying
    [J]. SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL-DISTRIBUTED COMPUTING 2010, 2010, 295 : 69 - +
  • [5] Data Processing and Model Selection for Machine Learning-based Network Intrusion Detection
    Sahu, Abhijeet
    Mao, Zeyu
    Davis, Katherine
    Goulart, Ana E.
    [J]. 2020 IEEE INTERNATIONAL WORKSHOP TECHNICAL COMMITTEE ON COMMUNICATIONS QUALITY AND RELIABILITY (CQR), 2020, : 49 - 54
  • [6] Machine Learning-Based Intrusion Detection System for Big Data Analytics in VANET
    Zang, Mingyuan
    Yan, Ying
    [J]. 2021 IEEE 93RD VEHICULAR TECHNOLOGY CONFERENCE (VTC2021-SPRING), 2021,
  • [7] The Cross-Evaluation of Machine Learning-Based Network Intrusion Detection Systems
    Apruzzese, Giovanni
    Pajola, Luca
    Conti, Mauro
    [J]. IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2022, 19 (04): : 5152 - 5169
  • [8] Cyber Security Intrusion Detection for Agriculture 4.0: Machine Learning-Based Solutions, Datasets, and Future Directions
    Ferrag, Mohamed Amine
    Shu, Lei
    Friha, Othmane
    Yang, Xing
    [J]. IEEE-CAA JOURNAL OF AUTOMATICA SINICA, 2022, 9 (03) : 407 - 436
  • [9] Cyber Security Intrusion Detection for Agriculture 4.0: Machine Learning-Based Solutions, Datasets,and Future Directions
    Mohamed Amine Ferrag
    Lei Shu
    Othmane Friha
    Xing Yang
    [J]. IEEE/CAA Journal of Automatica Sinica, 2022, 9 (03) : 407 - 436
  • [10] Time is of the Essence: Machine Learning-based Intrusion Detection in Industrial Time Series Data
    Anton, Simon Duque
    Ahrens, Lia
    Fraunholz, Daniel
    Schotten, Hans D.
    [J]. 2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2018, : 1 - 6