Software Vulnerabilities Detection Based on a Pre-trained Language Model

被引：0

作者：

Xu, Wenlin ^{[1
]}

Li, Tong ^{[2
]}

Wang, Jinsong ^{[3
]}

Duan, Haibo ^{[3
]}

Tang, Yahui ^{[4
]}

机构：

[1] Yunnan Univ, Sch Informat Sci & Engn, Kunming, Yunnan, Peoples R China

[2] Yunnan Agr Univ, Sch Big Data, Kunming, Yunnan, Peoples R China

[3] Yunnan Univ Finance & Econ, Informat Management Ctr, Kunming, Yunnan, Peoples R China

[4] Chongqing Univ Posts & Telecommun, Sch Software, Chongqing, Peoples R China

来源：

2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023 | 2024年

关键词：

Cyber security; Vulnerability detection; Pre-trained language model; Autoencoder; Outlier detection;

D O I：

10.1109/TrustCom60117.2023.00129

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Software vulnerabilities detection is crucial in cyber security which protects the software systems from malicious attacks. The majority of earlier techniques relied on security professionals to provide software features before training a classification or regression model on the features to find vulnerabilities. However, defining software features and collecting high-quality labeled vulnerabilities for training are both time consuming. To handle these issues, in this paper, we propose an unsupervised and effective method for extracting software features and detecting software vulnerabilities automatically. Firstly, we obtain software features and build a new pre-trained BERT model through constructing C/C++ vocabulary and pre-training on software source code. We then fine-tune the pre-trained BERT model with a deep autoencoder and create low-dimensional embedding from the software features. We finally apply a clustering-based outlier detection method on the embedding to detect vulnerabilities. We evaluate our method on five datasets with programs written in C/C++, experimental results show that our method outperforms state-of-the-art software vulnerability detection methods.

引用

页码：904 / 911

页数：8

共 50 条

[21] BERTweet: A pre-trained language model for English Tweets
Dat Quoc Nguyen
Thanh Vu
Anh Tuan Nguyen
PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING: SYSTEM DEMONSTRATIONS, 2020, : 9 - 14
[22] ViDeBERTa: A powerful pre-trained language model for Vietnamese
Tran, Cong Dao
Pham, Nhut Huy
Nguyen, Anh
Hy, Truong Son
Vu, Tu
17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1071 - 1078
[23] Misspelling Correction with Pre-trained Contextual Language Model
Hu, Yifei
Ting, Xiaonan
Ko, Youlim
Rayz, Julia Taylor
PROCEEDINGS OF 2020 IEEE 19TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC 2020), 2020, : 144 - 149
[24] CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
Zhao, Xiaoqing
Xu, Miaomiao
Silamu, Wushour
Li, Yanbing
SENSORS, 2024, 24 (22)
[25] A text restoration model for ancient texts based on pre-trained language model RoBERTa
Gu, Zhongyu
Guan, Yanzhi
Zhang, Shuai
PROCEEDINGS OF 2024 4TH INTERNATIONAL CONFERENCE ON INTERNET OF THINGS AND MACHINE LEARNING, IOTML 2024, 2024, : 96 - 102
[26] A Pre-Trained Language Model Based on LED for Tibetan Long Text Summarization
Ouyang, Xinpeng
Yan, Xiaodong
Hao, Minghui
PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 992 - 997
[27] A Pre-trained Language Model for Medical Question Answering Based on Domain Adaption
Liu, Lang
Ren, Junxiang
Wu, Yuejiao
Song, Ruilin
Cheng, Zhen
Wang, Sibo
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 216 - 227
[28] Adapting Pre-trained Language Models to Rumor Detection on Twitter
Slimi, Hamda
Bounhas, Ibrahim
Slimani, Yahya
JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2021, 27 (10) : 1128 - 1148
[29] Enhancing Language Generation with Effective Checkpoints of Pre-trained Language Model
Park, Jeonghyeok
Zhao, Hai
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2686 - 2694
[30] GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model
Gao, Yingying
Zhang, Shilei
Deng, Chao
Feng, Junlan
INTERSPEECH 2024, 2024, : 3325 - 3329

← 1 2 3 4 5 →