CRABS-former: CRoss-Architecture Binary Code Similarity Detection based on Transformer

被引:0
|
作者
Feng, Yuhong [1 ]
Li, Haoran [1 ]
Cao, Yixuan [1 ]
Wang, Yufeng [1 ]
Feng, Haiyue [1 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen, Guangdong, Peoples R China
关键词
Binary Analysis; Similarity Detection; Cross-Architecture; Transformer;
D O I
10.1145/3671016.3671390
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Binary code similarity detection (BCSD) is widely used in software analysis such as vulnerability detection and malware identification. Among various forms of binary representation, assembly is particularly feasible for real-world applications due to its efficient preprocessing compared to graph and intermediate representation (IR). Existing assembly-based methods leverage the text embedding capabilities of pretrained language models such as BERT, which still encounter limitations in cross-architecture BCSD due to the characteristics of assembly code and the lack of cross-architecture vocabulary. In this paper, we first design several normalization strategies to preprocess assembly code from multiple instruction set architectures (ISAs), in order to decrease the token length of assembly code inputs and reduce the size of vocabulary, thereby improving processing efficiency and simplifying model structure. Then, we propose a method to collect token instances and construct a tokenizer capable of processing assembly code from multiple ISAs, enhancing the model's ability to interpret such code. Based on this tokenizer, we develop a CRoss-Architecture Binary code Similarity detection model based on Transformer (CRABS-former). CRABS-former compares two binary functions from different ISAs, compilers or optimization options and computes their similarity score. Finally, we conduct experiments for two BCSD tasks (one-to-one and one-to-many) using CRABS-former, comparing its performance against four baselines: SAFE, Trex, jTrans, and TE3L. The results indicate that CRABS-former, with a pool size of 10,000, improves recall by 10.85%, 18.02%, and 3.33% across different ISAs, compilers, and optimizations, respectively, underscoring the effectiveness of our approach.
引用
收藏
页码:11 / 20
页数:10
相关论文
共 50 条
  • [1] CBSDI: Cross-Architecture Binary Code Similarity Detection based on Index Table
    Deng, Longmin
    Zhao, Dongdong
    Zhou, Junwei
    Xia, Zhe
    Xiang, Jianwen
    2022 IEEE 22ND INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2022, : 527 - 536
  • [2] Optir-SBERT: Cross-Architecture Binary Code Similarity Detection Based on Optimized LLVM IR
    Yan, Yintong
    Yu, Lu
    Wang, Taiyan
    Li, Yuwei
    Pan, Zulie
    DIGITAL FORENSICS AND CYBER CRIME, PT 2, ICDF2C 2023, 2024, 571 : 95 - 113
  • [3] Multi-Level Cross-Architecture Binary Code Similarity Metric
    Meng Qiao
    Xiaochuan Zhang
    Huihui Sun
    Zheng Shan
    Fudong Liu
    Wenjie Sun
    Xingwei Li
    Arabian Journal for Science and Engineering, 2021, 46 : 8603 - 8615
  • [4] Multi-Level Cross-Architecture Binary Code Similarity Metric
    Qiao, Meng
    Zhang, Xiaochuan
    Sun, Huihui
    Shan, Zheng
    Liu, Fudong
    Sun, Wenjie
    Li, Xingwei
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2021, 46 (09) : 8603 - 8615
  • [5] Cross-architecture Binary Function Similarity Detection based on Composite Feature Model
    Li, Xiaonan
    Zhang, Guimin
    Li, Qingbao
    Zhang, Ping
    Chen, Zhifeng
    Liu, Jinjin
    Yue, Shudan
    KSII TRANSACTIONS ON INTERNET AND INFORMATION SYSTEMS, 2023, 17 (08): : 2101 - 2123
  • [6] discovRE: Efficient Cross-Architecture Identification of Bugs in Binary Code
    Eschweiler, Sebastian
    Yakdan, Khaled
    Gerhards-Padilla, Elmar
    23RD ANNUAL NETWORK AND DISTRIBUTED SYSTEM SECURITY SYMPOSIUM (NDSS 2016), 2016,
  • [7] HAformer: Semantic fusion of hex machine code and assembly code for cross-architecture binary vulnerability detection
    Jiang, Xunzhi
    Wang, Shen
    Gong, Yuxin
    Yu, Tingyue
    Liu, Li
    Yu, Xiangzhan
    COMPUTERS & SECURITY, 2024, 145
  • [8] DVul-WLG: Graph Embedding Network Based on Code Similarity for Cross-Architecture Firmware Vulnerability Detection
    Sun, Hao
    Tong, Yanjun
    Zhao, Jing
    Gu, Zhaoquan
    INFORMATION SECURITY (ISC 2021), 2021, 13118 : 320 - 337
  • [9] Cross-Architecture Binary Semantics Understanding via Similar Code Comparison
    Hu, Yikun
    Zhang, Yuanyuan
    Li, Juanru
    Gu, Dawu
    2016 IEEE 23RD INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION, AND REENGINEERING (SANER), VOL 1, 2016, : 57 - 67
  • [10] Inter-BIN: Interaction-Based Cross-Architecture IoT Binary Similarity Comparison
    Song, Qige
    Zhang, Yongzheng
    Wang, Binglai
    Chen, Yige
    IEEE INTERNET OF THINGS JOURNAL, 2022, 9 (20): : 20018 - 20033