STIOCS: Active learning-based semi-supervised training framework for IOC extraction

被引:2
|
作者
Tang, Binhui [1 ,3 ]
Li, Xiaohui [1 ]
Wang, Junfeng [2 ]
Ge, Wenhan [2 ]
Yu, Zhongkun [2 ]
Lin, Tongcan [2 ]
机构
[1] Sichuan Univ, Sch Cyber Sci & Engn, Chengdu 610065, Peoples R China
[2] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China
[3] Cheng Du Jincheng Coll, Chengdu 610065, Peoples R China
基金
中国国家自然科学基金;
关键词
Cyber Threat Intelligence(CTI); IOC extraction; Semi-supervised learning; Self-training; Active Learning; DBSCAN; Fusion model;
D O I
10.1016/j.compeleceng.2023.108981
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Cyber Threat Intelligence (CTI) contains numerous Indicators of Compromise (IOCs) and contextual information, crucial for understanding threat actors' behavior and intentions. How-ever, current information extraction predominantly relies on supervised learning algorithms, presenting challenges in the field of CTI for two reasons. Firstly, the scarcity of labeled data with IOCs hampers the effectiveness of supervised learning. Secondly, existing methods struggle to extract comprehensive contextual features, posing difficulties in IOC recognition within CTI. To address these limitations and better suit the unique characteristics of CTI text, this paper introduces STIOCS, a semi-supervised framework that combines active learning and self-training for IOC extraction. STIOCS enhances IOC extraction accuracy and efficiency by leveraging limited labeled data and a rich unannotated corpus. Firstly, the Active Learning (AL) approach uses the Density-based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to select reliable samples that can reduce noise pollution on pseudo-labeling in self-training. The extraction model integrates Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) algorithms to extract local and sequential features from CTI text, respectively. Then, the semantic features are enhanced by using the different sizes of convolutional kernels to fuse the two types of features. Finally, the Conditional Random Fields (CRF) layer is employed to recognize IOC entities. Our experimental results demonstrate the effectiveness and robustness of our proposed method in IOC extraction, even with limited labeled data. Compared to supervised methods, our proposed method is only approximately 40% of the dataset is labeled, the F1 scores are achieved better than the existing methods and exhibit consistent performance improvements as the dataset size increases. STIOCS effectively suppresses weak label noise, reduces training costs, and enhances the recognition model's performance. It provides a cost-effective training framework for entity extraction in cyber threat intelligence.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Dual Learning-Based Safe Semi-Supervised Learning
    Gan, Haitao
    Li, Zhenhua
    Fan, Yingle
    Luo, Zhizeng
    [J]. IEEE ACCESS, 2018, 6 : 2615 - 2621
  • [2] Semi-supervised Clustering Framework Based on Active Learning for Real Data
    Odate, Ryosuke
    Shinjo, Hiroshi
    Suzuki, Yasufumi
    Motobayashi, Masahiro
    [J]. STRUCTURAL, SYNTACTIC, AND STATISTICAL PATTERN RECOGNITION, S+SSPR 2018, 2018, 11004 : 184 - 193
  • [3] A Semi-supervised Active Learning Framework for Image Classification
    Li, Han-yi
    Yang, Ming
    Kang, Nan-nan
    Yue, Lu-lu
    [J]. MECHATRONICS ENGINEERING, COMPUTING AND INFORMATION TECHNOLOGY, 2014, 556-562 : 4765 - 4769
  • [4] A semi-supervised active learning framework for image retrieval
    Hoi, SCH
    Lyu, MR
    [J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 2, PROCEEDINGS, 2005, : 302 - 309
  • [5] An adaptive semi-supervised deep learning-based framework for the detection of Android malware
    Wajahat, Ahsan
    He, Jingsha
    Zhu, Nafei
    Mahmood, Tariq
    Nazir, Ahsan
    Pathan, Muhammad Salman
    Qureshi, Sirajuddin
    Ullah, Faheem
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (03) : 5141 - 5157
  • [6] SEMI-SUPERVISED CO-TRAINING AND ACTIVE LEARNING FRAMEWORK FOR HYPERSPECTRAL IMAGE CLASSIFICATION
    Samiappan, Sathishkumar
    Moorhead, Robert J., II
    [J]. 2015 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2015, : 401 - 404
  • [7] A semi-supervised learning framework for biomedical event extraction based on hidden topics
    Zhou, Deyu
    Zhong, Dayou
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, 2015, 64 (01) : 51 - 58
  • [8] Semi-supervised learning combining co-training with active learning
    Zhang, Yihao
    Wen, Junhao
    Wang, Xibin
    Jiang, Zhuo
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2014, 41 (05) : 2372 - 2378
  • [9] Soft Semi-Supervised Deep Learning-Based Clustering
    Alzuhair, Mona Suliman
    Ben Ismail, Mohamed Maher
    Bchir, Ouiem
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (17):
  • [10] A Unified Active and Semi-Supervised Learning Framework for Image Compression
    He, Xiaofei
    Ji, Ming
    Bao, Hujun
    [J]. CVPR: 2009 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-4, 2009, : 65 - 72