Software Vulnerabilities Detection Based on a Pre-trained Language Model

被引:0
|
作者
Xu, Wenlin [1 ]
Li, Tong [2 ]
Wang, Jinsong [3 ]
Duan, Haibo [3 ]
Tang, Yahui [4 ]
机构
[1] Yunnan Univ, Sch Informat Sci & Engn, Kunming, Yunnan, Peoples R China
[2] Yunnan Agr Univ, Sch Big Data, Kunming, Yunnan, Peoples R China
[3] Yunnan Univ Finance & Econ, Informat Management Ctr, Kunming, Yunnan, Peoples R China
[4] Chongqing Univ Posts & Telecommun, Sch Software, Chongqing, Peoples R China
关键词
Cyber security; Vulnerability detection; Pre-trained language model; Autoencoder; Outlier detection;
D O I
10.1109/TrustCom60117.2023.00129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software vulnerabilities detection is crucial in cyber security which protects the software systems from malicious attacks. The majority of earlier techniques relied on security professionals to provide software features before training a classification or regression model on the features to find vulnerabilities. However, defining software features and collecting high-quality labeled vulnerabilities for training are both time consuming. To handle these issues, in this paper, we propose an unsupervised and effective method for extracting software features and detecting software vulnerabilities automatically. Firstly, we obtain software features and build a new pre-trained BERT model through constructing C/C++ vocabulary and pre-training on software source code. We then fine-tune the pre-trained BERT model with a deep autoencoder and create low-dimensional embedding from the software features. We finally apply a clustering-based outlier detection method on the embedding to detect vulnerabilities. We evaluate our method on five datasets with programs written in C/C++, experimental results show that our method outperforms state-of-the-art software vulnerability detection methods.
引用
收藏
页码:904 / 911
页数:8
相关论文
共 50 条
  • [1] Detection of Chinese Deceptive Reviews Based on Pre-Trained Language Model
    Weng, Chia-Hsien
    Lin, Kuan-Cheng
    Ying, Jia-Ching
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (07):
  • [2] Data Augmentation Based on Pre-trained Language Model for Event Detection
    Zhang, Meng
    Xie, Zhiwen
    Liu, Jin
    [J]. CCKS 2021 - EVALUATION TRACK, 2022, 1553 : 59 - 68
  • [3] Hyperbolic Pre-Trained Language Model
    Chen, Weize
    Han, Xu
    Lin, Yankai
    He, Kaichen
    Xie, Ruobing
    Zhou, Jie
    Liu, Zhiyuan
    Sun, Maosong
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2024, 32 : 3101 - 3112
  • [4] Pre-trained Model Based Feature Envy Detection
    Ma, Wenhao
    Yu, Yaoxiang
    Ruan, Xiaoming
    Cai, Bo
    [J]. 2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 430 - 440
  • [5] Repairing Security Vulnerabilities Using Pre-trained Programming Language Models
    Huang, Kai
    Yang, Su
    Sun, Hongyu
    Sun, Chengyi
    Li, Xuejun
    Zhang, Yuqing
    [J]. 52ND ANNUAL IEEE/IFIP INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND NETWORKS WORKSHOP VOLUME (DSN-W 2022), 2022, : 111 - 116
  • [6] Pre-trained Language Model Representations for Language Generation
    Edunov, Sergey
    Baevski, Alexei
    Auli, Michael
    [J]. 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 4052 - 4059
  • [7] Pre-trained Language Model based Ranking in Baidu Search
    Zou, Lixin
    Zhang, Shengqiang
    Cai, Hengyi
    Ma, Dehong
    Cheng, Suqi
    Wang, Shuaiqiang
    Shi, Daiting
    Cheng, Zhicong
    Yin, Dawei
    [J]. KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4014 - 4022
  • [9] Interpretability of Entity Matching Based on Pre-trained Language Model
    Liang, Zheng
    Wang, Hong-Zhi
    Dai, Jia-Jia
    Shao, Xin-Yue
    Ding, Xiao-Ou
    Mu, Tian-Yu
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (03): : 1087 - 1108
  • [10] Comparing Pre-Trained Language Model for Arabic Hate Speech Detection
    Daouadi, Kheir Eddine
    Boualleg, Yaakoub
    Guehairia, Oussama
    [J]. COMPUTACION Y SISTEMAS, 2024, 28 (02): : 681 - 693