STIOCS: Active learning-based semi-supervised training framework for IOC extraction

被引:2
|
作者
Tang, Binhui [1 ,3 ]
Li, Xiaohui [1 ]
Wang, Junfeng [2 ]
Ge, Wenhan [2 ]
Yu, Zhongkun [2 ]
Lin, Tongcan [2 ]
机构
[1] Sichuan Univ, Sch Cyber Sci & Engn, Chengdu 610065, Peoples R China
[2] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
[3] Cheng Du Jincheng Coll, Chengdu 610065, Peoples R China
基金
中国国家自然科学基金;
关键词
Cyber Threat Intelligence(CTI); IOC extraction; Semi-supervised learning; Self-training; Active Learning; DBSCAN; Fusion model;
D O I
10.1016/j.compeleceng.2023.108981
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cyber Threat Intelligence (CTI) contains numerous Indicators of Compromise (IOCs) and contextual information, crucial for understanding threat actors' behavior and intentions. How-ever, current information extraction predominantly relies on supervised learning algorithms, presenting challenges in the field of CTI for two reasons. Firstly, the scarcity of labeled data with IOCs hampers the effectiveness of supervised learning. Secondly, existing methods struggle to extract comprehensive contextual features, posing difficulties in IOC recognition within CTI. To address these limitations and better suit the unique characteristics of CTI text, this paper introduces STIOCS, a semi-supervised framework that combines active learning and self-training for IOC extraction. STIOCS enhances IOC extraction accuracy and efficiency by leveraging limited labeled data and a rich unannotated corpus. Firstly, the Active Learning (AL) approach uses the Density-based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to select reliable samples that can reduce noise pollution on pseudo-labeling in self-training. The extraction model integrates Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) algorithms to extract local and sequential features from CTI text, respectively. Then, the semantic features are enhanced by using the different sizes of convolutional kernels to fuse the two types of features. Finally, the Conditional Random Fields (CRF) layer is employed to recognize IOC entities. Our experimental results demonstrate the effectiveness and robustness of our proposed method in IOC extraction, even with limited labeled data. Compared to supervised methods, our proposed method is only approximately 40% of the dataset is labeled, the F1 scores are achieved better than the existing methods and exhibit consistent performance improvements as the dataset size increases. STIOCS effectively suppresses weak label noise, reduces training costs, and enhances the recognition model's performance. It provides a cost-effective training framework for entity extraction in cyber threat intelligence.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Combining Committee-Based Semi-Supervised Learning and Active Learning
    Mohamed Farouk Abdel Hady
    Friedhelm Schwenker
    [J]. Journal of Computer Science and Technology, 2010, 25 : 681 - 698
  • [32] Combining Committee-Based Semi-Supervised Learning and Active Learning
    Mohamed Farouk Abdel Hady
    Friedhelm Schwenker
    [J]. Journal of Computer Science & Technology, 2010, 25 (04) : 681 - 698
  • [34] Interactive Cell Segmentation Based on Active and Semi-Supervised Learning
    Su, Hang
    Yin, Zhaozheng
    Huh, Seungil
    Kanade, Takeo
    Zhu, Jun
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2016, 35 (03) : 762 - 777
  • [35] Active Learning for Semi-supervised Classification Based On Information Entropy
    Jie, Shen
    Xin, Fan
    Wen, Shen
    [J]. 2009 INTERNATIONAL FORUM ON INFORMATION TECHNOLOGY AND APPLICATIONS, VOL 2, PROCEEDINGS, 2009, : 591 - 595
  • [36] Network Intrusion Detection Based on Active Semi-supervised Learning
    Zhang, Yong
    Niu, Jie
    He, Guojian
    Zhu, Lin
    Guo, Da
    [J]. 51ST ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS (DSN-W 2021), 2021, : 129 - 135
  • [37] Protein Function Prediction Based on Active Semi-supervised Learning
    WANG Xuesong
    CHENG Yuhu
    LI Lijing
    [J]. Chinese Journal of Electronics, 2016, 25 (04) : 595 - 600
  • [38] Protein Function Prediction Based on Active Semi-supervised Learning
    Wang Xuesong
    Cheng Yuhu
    Li Lijing
    [J]. CHINESE JOURNAL OF ELECTRONICS, 2016, 25 (04) : 595 - 600
  • [39] Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme
    Fazakis, Nikos
    Kanas, Vasileios G.
    Aridas, Christos K.
    Karlos, Stamatis
    Kotsiantis, Sotiris
    [J]. ENTROPY, 2019, 21 (10)
  • [40] Semi-supervised Learning Framework for UAV Detection
    Medaiyese, Olusiji O.
    Ezuma, Martins
    Lauf, Adrian P.
    Guvenc, Ismail
    [J]. 2021 IEEE 32ND ANNUAL INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR AND MOBILE RADIO COMMUNICATIONS (PIMRC), 2021,