Software Vulnerabilities Detection Based on a Pre-trained Language Model

被引:0
|
作者
Xu, Wenlin [1 ]
Li, Tong [2 ]
Wang, Jinsong [3 ]
Duan, Haibo [3 ]
Tang, Yahui [4 ]
机构
[1] Yunnan Univ, Sch Informat Sci & Engn, Kunming, Yunnan, Peoples R China
[2] Yunnan Agr Univ, Sch Big Data, Kunming, Yunnan, Peoples R China
[3] Yunnan Univ Finance & Econ, Informat Management Ctr, Kunming, Yunnan, Peoples R China
[4] Chongqing Univ Posts & Telecommun, Sch Software, Chongqing, Peoples R China
来源
2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023 | 2024年
关键词
Cyber security; Vulnerability detection; Pre-trained language model; Autoencoder; Outlier detection;
D O I
10.1109/TrustCom60117.2023.00129
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Software vulnerabilities detection is crucial in cyber security which protects the software systems from malicious attacks. The majority of earlier techniques relied on security professionals to provide software features before training a classification or regression model on the features to find vulnerabilities. However, defining software features and collecting high-quality labeled vulnerabilities for training are both time consuming. To handle these issues, in this paper, we propose an unsupervised and effective method for extracting software features and detecting software vulnerabilities automatically. Firstly, we obtain software features and build a new pre-trained BERT model through constructing C/C++ vocabulary and pre-training on software source code. We then fine-tune the pre-trained BERT model with a deep autoencoder and create low-dimensional embedding from the software features. We finally apply a clustering-based outlier detection method on the embedding to detect vulnerabilities. We evaluate our method on five datasets with programs written in C/C++, experimental results show that our method outperforms state-of-the-art software vulnerability detection methods.
引用
收藏
页码:904 / 911
页数:8
相关论文
共 50 条
  • [41] ConfliBERT: A Pre-trained Language Model for Political Conflict and Violence
    Hu, Yibo
    Hosseini, MohammadSaleh
    Parolin, Erick Skorupa
    Osorio, Javier
    Khan, Latifur
    Brandt, Patrick T.
    D'Orazio, Vito J.
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 5469 - 5482
  • [42] IndicBART: A Pre-trained Model for Indic Natural Language Generation
    Dabre, Raj
    Shrotriya, Himani
    Kunchukuttan, Anoop
    Puduppully, Ratish
    Khapra, Mitesh M.
    Kumar, Pratyush
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1849 - 1863
  • [43] Leveraging Pre-trained Language Model for Speech Sentiment Analysis
    Shon, Suwon
    Brusco, Pablo
    Pan, Jing
    Han, Kyu J.
    Watanabe, Shinji
    INTERSPEECH 2021, 2021, : 3420 - 3424
  • [44] AraXLNet: pre-trained language model for sentiment analysis of Arabic
    Alduailej, Alhanouf
    Alothaim, Abdulrahman
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [45] Integrating Pre-Trained Language Model With Physical Layer Communications
    Lee, Ju-Hyung
    Lee, Dong-Ho
    Lee, Joohan
    Pujara, Jay
    IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2024, 23 (11) : 17266 - 17278
  • [46] SsciBERT: a pre-trained language model for social science texts
    Shen, Si
    Liu, Jiangfeng
    Lin, Litao
    Huang, Ying
    Zhang, Lin
    Liu, Chang
    Feng, Yutong
    Wang, Dongbo
    SCIENTOMETRICS, 2023, 128 (02) : 1241 - 1263
  • [47] TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models
    Yang, Ziqing
    Cui, Yiming
    Chen, Zhigang
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, 2022, : 35 - 43
  • [48] Learning and Evaluating a Differentially Private Pre-trained Language Model
    Hoory, Shlomo
    Feder, Amir
    Tendler, Avichai
    Cohen, Alon
    Erell, Sofia
    Laish, Itay
    Nakhost, Hootan
    Stemmer, Uri
    Benjamini, Ayelet
    Hassidim, Avinatan
    Matias, Yossi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1178 - 1189
  • [49] Idiom Cloze Algorithm Integrating with Pre-trained Language Model
    Ju S.-G.
    Huang F.-Y.
    Sun J.-P.
    Ruan Jian Xue Bao/Journal of Software, 2022, 33 (10): : 3793 - 3805
  • [50] SPEECHCLIP: INTEGRATING SPEECH WITH PRE-TRAINED VISION AND LANGUAGE MODEL
    Shih, Yi-Jen
    Wang, Hsuan-Fu
    Chang, Heng-Jui
    Berry, Layne
    Lee, Hung-yi
    Harwath, David
    2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 715 - 722