Software Vulnerabilities Detection Based on a Pre-trained Language Model

被引:0
|
作者
Xu, Wenlin [1 ]
Li, Tong [2 ]
Wang, Jinsong [3 ]
Duan, Haibo [3 ]
Tang, Yahui [4 ]
机构
[1] Yunnan Univ, Sch Informat Sci & Engn, Kunming, Yunnan, Peoples R China
[2] Yunnan Agr Univ, Sch Big Data, Kunming, Yunnan, Peoples R China
[3] Yunnan Univ Finance & Econ, Informat Management Ctr, Kunming, Yunnan, Peoples R China
[4] Chongqing Univ Posts & Telecommun, Sch Software, Chongqing, Peoples R China
来源
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023 | 2024年
关键词
Cyber security; Vulnerability detection; Pre-trained language model; Autoencoder; Outlier detection;
D O I
10.1109/TrustCom60117.2023.00129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software vulnerabilities detection is crucial in cyber security which protects the software systems from malicious attacks. The majority of earlier techniques relied on security professionals to provide software features before training a classification or regression model on the features to find vulnerabilities. However, defining software features and collecting high-quality labeled vulnerabilities for training are both time consuming. To handle these issues, in this paper, we propose an unsupervised and effective method for extracting software features and detecting software vulnerabilities automatically. Firstly, we obtain software features and build a new pre-trained BERT model through constructing C/C++ vocabulary and pre-training on software source code. We then fine-tune the pre-trained BERT model with a deep autoencoder and create low-dimensional embedding from the software features. We finally apply a clustering-based outlier detection method on the embedding to detect vulnerabilities. We evaluate our method on five datasets with programs written in C/C++, experimental results show that our method outperforms state-of-the-art software vulnerability detection methods.
引用
收藏
页码:904 / 911
页数:8
相关论文
共 50 条
  • [31] Grammatical Error Correction by Transferring Learning Based on Pre-Trained Language Model
    Han M.
    Wang Y.
    Shanghai Jiaotong Daxue Xuebao/Journal of Shanghai Jiaotong University, 2022, 56 (11): : 1554 - 1560
  • [32] NMT Enhancement based on Knowledge Graph Mining with Pre-trained Language Model
    Yang, Hao
    Qin, Ying
    Deng, Yao
    Wang, Minghan
    2020 22ND INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY (ICACT): DIGITAL SECURITY GLOBAL AGENDA FOR SAFE SOCIETY!, 2020, : 185 - 189
  • [33] Pre-trained Language Model-based Retrieval and Ranking forWeb Search
    Zou, Lixin
    Lu, Weixue
    Liu, Yiding
    Cai, Hengyi
    Chu, Xiaokai
    Ma, Dehong
    Shi, Daiting
    Sun, Yu
    Cheng, Zhicong
    Gu, Simiu
    Wang, Shuaiqiang
    Yin, Dawei
    ACM TRANSACTIONS ON THE WEB, 2023, 17 (01)
  • [34] Utilization of pre-trained language models for adapter-based knowledge transfer in software engineering
    Saberi, Iman
    Fard, Fatemeh
    Chen, Fuxiang
    EMPIRICAL SOFTWARE ENGINEERING, 2024, 29 (04)
  • [35] SsciBERT: a pre-trained language model for social science texts
    Si Shen
    Jiangfeng Liu
    Litao Lin
    Ying Huang
    Lin Zhang
    Chang Liu
    Yutong Feng
    Dongbo Wang
    Scientometrics, 2023, 128 : 1241 - 1263
  • [36] A Pre-trained Clinical Language Model for Acute Kidney Injury
    Mao, Chengsheng
    Yao, Liang
    Luo, Yuan
    2020 8TH IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2020), 2020, : 531 - 532
  • [37] Few-Shot NLG with Pre-Trained Language Model
    Chen, Zhiyu
    Eavani, Harini
    Chen, Wenhu
    Liu, Yinyin
    Wang, William Yang
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 183 - 190
  • [38] Knowledge Enhanced Pre-trained Language Model for Product Summarization
    Yin, Wenbo
    Ren, Junxiang
    Wu, Yuejiao
    Song, Ruilin
    Liu, Lang
    Cheng, Zhen
    Wang, Sibo
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 263 - 273
  • [39] Relational Prompt-Based Pre-Trained Language Models for Social Event Detection
    Li, Pu
    Yu, Xiaoyan
    Peng, Hao
    Xian, Yantuan
    Wang, Linqin
    Sun, Li
    Zhang, Jingyun
    Yu, Philip S.
    ACM Transactions on Information Systems, 2024, 43 (01)
  • [40] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    ENGINEERING, 2023, 25 : 51 - 65