MaGnn: Binary-Source Code Matching by Modality-Sharing Graph Convolution for Binary Provenance Analysis

被引:0
|
作者
Ou, Weihan [1 ]
Ding, Steven H. H. [1 ]
机构
[1] Queens Univ, Sch Comp, Kingston, ON, Canada
关键词
binary provenance; representation learning; binary source code matching; graph learning;
D O I
10.1109/COMPSAC57700.2023.00091
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The number and variety of binaries running on electrical devices, public clouds, and on-premise infrastructure have been increasing rapidly. Recent successful supply chain attacks indicate that even for binaries known to be developed by trustful developers, they can still contain malicious functionalities and copy-and-pasted vulnerabilities that pose security risks to operational systems and end users. By analyzing the origin of a target code, code provenance analysis helps to relieve such problem by revealing information about the origin of a binary sample such as the author or the included software bill-of-materials. Since in most cases source symbol information is removed during the compilation process, given a binary code sample, matching it to its corresponding source code could improve the accuracy and efficiency of the provenance analysis. Existing binary-source code matching methods focus on comparing manually selected code literals (e.g. the number of if/else statements). However, these methods suffer from the issue of generalizability and require significant manual efforts. Different from the previous methods, we propose a machine learning-based binary-source code matching system, MaGnn, which measures the consistency of an input binary-source code pair by automatically extracting high-dimensional feature representations of the input and calculating the functionality similarity. With the Siamese architecture that shares a unified encoder across two modalities, McGnn is able to calculate the similarity of the input binary-source code pair with the automatically-extracted functionality representations. With the graph convolution neural network as the representation encoder, MaGnn is able to learn and encode the functionality information of the input pairs from their graph features into high-dimensional representation vectors. We benchmark MaGnn with a state-of-the-art binary-source code matching method and two machinelearning models on six out-of-sample datasets collected from five real-world libraries. Our experiment results show that MaGnn outperforms the baselines on most out-of-sample datasets.
引用
收藏
页码:658 / 666
页数:9
相关论文
共 10 条
  • [1] Cross-Language Binary-Source Code Matching with Intermediate Representations
    Gui, Yi
    Wan, Yao
    Zhang, Hongyu
    Huang, Huifang
    Sui, Yulei
    Xu, Guandong
    Shao, Zhiyuan
    Jin, Hai
    2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, : 601 - 612
  • [2] Decompilation Based Deep Binary-Source Function Matching
    Wang, Xiaowei
    Yuan, Zimu
    Xiao, Yang
    Wang, Liyan
    Yao, Yican
    Chen, Haiming
    Huo, Wei
    SCIENCE OF CYBER SECURITY, SCISEC 2023, 2023, 14299 : 244 - 260
  • [3] Binary Code Properties of Perfect Matching in Hexagonal Graph
    Khantavchai, Asekha
    Jiarasuksakun, Thiradet
    THAI JOURNAL OF MATHEMATICS, 2021, 19 (02): : 593 - 605
  • [4] BinDeep: Binary to Source Code Matching Using Deep Learning
    Alrabaee, Saed
    Choo, Kim-Kwang Raymond
    Qbea'h, Mohammad
    Khasawneh, Mahmoud
    2021 IEEE 20TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2021), 2021, : 1100 - 1107
  • [5] GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching
    TehraniJamsaz, Ali
    Chen, Hanze
    Jannesari, Ali
    2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 506 - 515
  • [6] SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code
    Alrabaee, Saed
    Shirani, Paria
    Wang, Lingyu
    Debbabi, Mourad
    DIGITAL INVESTIGATION, 2015, 12 : S61 - S71
  • [7] BUGGRAPH: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network
    Ji, Yuede
    Cui, Lei
    Huang, H. Howie
    ASIA CCS'21: PROCEEDINGS OF THE 2021 ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2021, : 702 - 715
  • [8] Cross-Platform Binary Code Homology Analysis Based on GRU Graph Embedding
    Wang, Shen
    Jiang, Xunzhi
    Yu, Xiangzhan
    Su, Xiaohui
    SECURITY AND COMMUNICATION NETWORKS, 2021, 2021
  • [9] Code Design and Performance Analysis Using a 2-Level Generalized Tanner Graph on the Binary Erasure Channel
    Rosnes, Eirik
    2008 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS, VOLS 1-3, 2008, : 1537 - 1542
  • [10] Research on the collaborative analysis technology for source code and binary executable based upon the unified defect mode set
    Liang, Xiaobing
    Cui, Baojiang
    Lv, Yingjie
    Fu, Yilun
    2015 9TH INTERNATIONAL CONFERENCE ON INNOVATIVE MOBILE AND INTERNET SERVICES IN UBIQUITOUS COMPUTING IMIS 2015, 2015, : 260 - 264