Software Vulnerabilities Detection Based on a Pre-trained Language Model

被引:0
|
作者
Xu, Wenlin [1 ]
Li, Tong [2 ]
Wang, Jinsong [3 ]
Duan, Haibo [3 ]
Tang, Yahui [4 ]
机构
[1] Yunnan Univ, Sch Informat Sci & Engn, Kunming, Yunnan, Peoples R China
[2] Yunnan Agr Univ, Sch Big Data, Kunming, Yunnan, Peoples R China
[3] Yunnan Univ Finance & Econ, Informat Management Ctr, Kunming, Yunnan, Peoples R China
[4] Chongqing Univ Posts & Telecommun, Sch Software, Chongqing, Peoples R China
来源
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023 | 2024年
关键词
Cyber security; Vulnerability detection; Pre-trained language model; Autoencoder; Outlier detection;
D O I
10.1109/TrustCom60117.2023.00129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software vulnerabilities detection is crucial in cyber security which protects the software systems from malicious attacks. The majority of earlier techniques relied on security professionals to provide software features before training a classification or regression model on the features to find vulnerabilities. However, defining software features and collecting high-quality labeled vulnerabilities for training are both time consuming. To handle these issues, in this paper, we propose an unsupervised and effective method for extracting software features and detecting software vulnerabilities automatically. Firstly, we obtain software features and build a new pre-trained BERT model through constructing C/C++ vocabulary and pre-training on software source code. We then fine-tune the pre-trained BERT model with a deep autoencoder and create low-dimensional embedding from the software features. We finally apply a clustering-based outlier detection method on the embedding to detect vulnerabilities. We evaluate our method on five datasets with programs written in C/C++, experimental results show that our method outperforms state-of-the-art software vulnerability detection methods.
引用
收藏
页码:904 / 911
页数:8
相关论文
共 50 条
  • [21] BERTweet: A pre-trained language model for English Tweets
    Dat Quoc Nguyen
    Thanh Vu
    Anh Tuan Nguyen
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING: SYSTEM DEMONSTRATIONS, 2020, : 9 - 14
  • [22] ViDeBERTa: A powerful pre-trained language model for Vietnamese
    Tran, Cong Dao
    Pham, Nhut Huy
    Nguyen, Anh
    Hy, Truong Son
    Vu, Tu
    17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 1071 - 1078
  • [23] Misspelling Correction with Pre-trained Contextual Language Model
    Hu, Yifei
    Ting, Xiaonan
    Ko, Youlim
    Rayz, Julia Taylor
    PROCEEDINGS OF 2020 IEEE 19TH INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS & COGNITIVE COMPUTING (ICCI*CC 2020), 2020, : 144 - 149
  • [24] CLIP-Llama: A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model
    Zhao, Xiaoqing
    Xu, Miaomiao
    Silamu, Wushour
    Li, Yanbing
    SENSORS, 2024, 24 (22)
  • [25] A text restoration model for ancient texts based on pre-trained language model RoBERTa
    Gu, Zhongyu
    Guan, Yanzhi
    Zhang, Shuai
    PROCEEDINGS OF 2024 4TH INTERNATIONAL CONFERENCE ON INTERNET OF THINGS AND MACHINE LEARNING, IOTML 2024, 2024, : 96 - 102
  • [26] A Pre-Trained Language Model Based on LED for Tibetan Long Text Summarization
    Ouyang, Xinpeng
    Yan, Xiaodong
    Hao, Minghui
    PROCEEDINGS OF THE 2024 27 TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, CSCWD 2024, 2024, : 992 - 997
  • [27] A Pre-trained Language Model for Medical Question Answering Based on Domain Adaption
    Liu, Lang
    Ren, Junxiang
    Wu, Yuejiao
    Song, Ruilin
    Cheng, Zhen
    Wang, Sibo
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 216 - 227
  • [28] Adapting Pre-trained Language Models to Rumor Detection on Twitter
    Slimi, Hamda
    Bounhas, Ibrahim
    Slimani, Yahya
    JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2021, 27 (10) : 1128 - 1148
  • [29] Enhancing Language Generation with Effective Checkpoints of Pre-trained Language Model
    Park, Jeonghyeok
    Zhao, Hai
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 2686 - 2694
  • [30] GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model
    Gao, Yingying
    Zhang, Shilei
    Deng, Chao
    Feng, Junlan
    INTERSPEECH 2024, 2024, : 3325 - 3329