STIOCS: Active learning-based semi-supervised training framework for IOC extraction

被引：2

作者：

Tang, Binhui ^{[1
,3
]}

Li, Xiaohui ^{[1
]}

Wang, Junfeng ^{[2
]}

Ge, Wenhan ^{[2
]}

Yu, Zhongkun ^{[2
]}

Lin, Tongcan ^{[2
]}

机构：

[1] Sichuan Univ, Sch Cyber Sci & Engn, Chengdu 610065, Peoples R China

[2] Sichuan Univ, Coll Comp Sci, Chengdu 610065, Peoples R China

[3] Cheng Du Jincheng Coll, Chengdu 610065, Peoples R China

来源：

COMPUTERS & ELECTRICAL ENGINEERING | 2023年 / 112卷

基金：

中国国家自然科学基金;

关键词：

Cyber Threat Intelligence(CTI); IOC extraction; Semi-supervised learning; Self-training; Active Learning; DBSCAN; Fusion model;

D O I：

10.1016/j.compeleceng.2023.108981

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Cyber Threat Intelligence (CTI) contains numerous Indicators of Compromise (IOCs) and contextual information, crucial for understanding threat actors' behavior and intentions. How-ever, current information extraction predominantly relies on supervised learning algorithms, presenting challenges in the field of CTI for two reasons. Firstly, the scarcity of labeled data with IOCs hampers the effectiveness of supervised learning. Secondly, existing methods struggle to extract comprehensive contextual features, posing difficulties in IOC recognition within CTI. To address these limitations and better suit the unique characteristics of CTI text, this paper introduces STIOCS, a semi-supervised framework that combines active learning and self-training for IOC extraction. STIOCS enhances IOC extraction accuracy and efficiency by leveraging limited labeled data and a rich unannotated corpus. Firstly, the Active Learning (AL) approach uses the Density-based Spatial Clustering of Applications with Noise (DBSCAN) algorithm to select reliable samples that can reduce noise pollution on pseudo-labeling in self-training. The extraction model integrates Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) algorithms to extract local and sequential features from CTI text, respectively. Then, the semantic features are enhanced by using the different sizes of convolutional kernels to fuse the two types of features. Finally, the Conditional Random Fields (CRF) layer is employed to recognize IOC entities. Our experimental results demonstrate the effectiveness and robustness of our proposed method in IOC extraction, even with limited labeled data. Compared to supervised methods, our proposed method is only approximately 40% of the dataset is labeled, the F1 scores are achieved better than the existing methods and exhibit consistent performance improvements as the dataset size increases. STIOCS effectively suppresses weak label noise, reduces training costs, and enhances the recognition model's performance. It provides a cost-effective training framework for entity extraction in cyber threat intelligence.

引用

页数：16

共 50 条

[41] A Probabilistic Contrastive Framework for Semi-Supervised Learning
Lin, Huibin
Zhang, Chun-Yang
Wang, Shiping
Guo, Wenzhong
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8767 - 8779
[42] Semi-supervised active learning image classification method based on Tri-Training algorithm
Zhang, Yongjun
Yan, Siyu
[J]. PROCEEDINGS OF 2020 IEEE INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INFORMATION SYSTEMS (ICAIIS), 2020, : 206 - 210
[43] On the Learning Dynamics of Semi-Supervised Training for ASR
Wallington, Electra
Kershenbaum, Benji
Klejch, Ondrej
Bell, Peter
[J]. INTERSPEECH 2021, 2021, : 716 - 720
[44] A unified framework for semi-supervised PU learning
Haoji Hu
Chaofeng Sha
Xiaoling Wang
Aoying Zhou
[J]. World Wide Web, 2014, 17 : 493 - 510
[45] A unified framework for semi-supervised PU learning
Hu, Haoji
Sha, Chaofeng
Wang, Xiaoling
Zhou, Aoying
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2014, 17 (04): : 493 - 510
[46] Interpolation consistency training for semi-supervised learning
Verma, Vikas
Kawaguchi, Kenji
Lamb, Alex
Kannala, Juho
Solin, Arno
Bengio, Yoshua
Lopez-Paz, David
[J]. NEURAL NETWORKS, 2022, 145 : 90 - 106
[47] Interpolation Consistency Training for Semi-Supervised Learning
Verma, Vikas
Lamb, Alex
Kannala, Juho
Bengio, Yoshua
Lopez-Paz, David
[J]. PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 3635 - 3641
[48] MarginGAN: Adversarial Training in Semi-Supervised Learning
Dong, Jinhao
Lin, Tong
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[49] A semi-supervised active learning algorithm for information extraction from textual data
Wu, TH
Pottenger, WM
[J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2005, 56 (03): : 258 - 271
[50] Semantic Segmentation with Active Semi-Supervised Learning
Rangnekar, Aneesh
Kanan, Christopher
Hoffman, Matthew
[J]. 2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 5955 - 5966

← 1 2 3 4 5 →