MaGnn: Binary-Source Code Matching by Modality-Sharing Graph Convolution for Binary Provenance Analysis

被引：0

作者：

Ou, Weihan ^{[1
]}

Ding, Steven H. H. ^{[1
]}

机构：

[1] Queens Univ, Sch Comp, Kingston, ON, Canada

来源：

2023 IEEE 47TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE, COMPSAC | 2023年

关键词：

binary provenance; representation learning; binary source code matching; graph learning;

D O I：

10.1109/COMPSAC57700.2023.00091

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The number and variety of binaries running on electrical devices, public clouds, and on-premise infrastructure have been increasing rapidly. Recent successful supply chain attacks indicate that even for binaries known to be developed by trustful developers, they can still contain malicious functionalities and copy-and-pasted vulnerabilities that pose security risks to operational systems and end users. By analyzing the origin of a target code, code provenance analysis helps to relieve such problem by revealing information about the origin of a binary sample such as the author or the included software bill-of-materials. Since in most cases source symbol information is removed during the compilation process, given a binary code sample, matching it to its corresponding source code could improve the accuracy and efficiency of the provenance analysis. Existing binary-source code matching methods focus on comparing manually selected code literals (e.g. the number of if/else statements). However, these methods suffer from the issue of generalizability and require significant manual efforts. Different from the previous methods, we propose a machine learning-based binary-source code matching system, MaGnn, which measures the consistency of an input binary-source code pair by automatically extracting high-dimensional feature representations of the input and calculating the functionality similarity. With the Siamese architecture that shares a unified encoder across two modalities, McGnn is able to calculate the similarity of the input binary-source code pair with the automatically-extracted functionality representations. With the graph convolution neural network as the representation encoder, MaGnn is able to learn and encode the functionality information of the input pairs from their graph features into high-dimensional representation vectors. We benchmark MaGnn with a state-of-the-art binary-source code matching method and two machinelearning models on six out-of-sample datasets collected from five real-world libraries. Our experiment results show that MaGnn outperforms the baselines on most out-of-sample datasets.

引用

页码：658 / 666

页数：9

共 10 条

[1] Cross-Language Binary-Source Code Matching with Intermediate Representations
Gui, Yi
Wan, Yao
Zhang, Hongyu
Huang, Huifang
Sui, Yulei
Xu, Guandong
Shao, Zhiyuan
Jin, Hai
2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, : 601 - 612
[2] Decompilation Based Deep Binary-Source Function Matching
Wang, Xiaowei
Yuan, Zimu
Xiao, Yang
Wang, Liyan
Yao, Yican
Chen, Haiming
Huo, Wei
SCIENCE OF CYBER SECURITY, SCISEC 2023, 2023, 14299 : 244 - 260
[3] Binary Code Properties of Perfect Matching in Hexagonal Graph
Khantavchai, Asekha
Jiarasuksakun, Thiradet
THAI JOURNAL OF MATHEMATICS, 2021, 19 (02): : 593 - 605
[4] BinDeep: Binary to Source Code Matching Using Deep Learning
Alrabaee, Saed
Choo, Kim-Kwang Raymond
Qbea'h, Mohammad
Khasawneh, Mahmoud
2021 IEEE 20TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2021), 2021, : 1100 - 1107
[5] GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching
TehraniJamsaz, Ali
Chen, Hanze
Jannesari, Ali
2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 506 - 515
[6] SIGMA: A Semantic Integrated Graph Matching Approach for identifying reused functions in binary code
Alrabaee, Saed
Shirani, Paria
Wang, Lingyu
Debbabi, Mourad
DIGITAL INVESTIGATION, 2015, 12 : S61 - S71
[7] BUGGRAPH: Differentiating Source-Binary Code Similarity with Graph Triplet-Loss Network
Ji, Yuede
Cui, Lei
Huang, H. Howie
ASIA CCS'21: PROCEEDINGS OF THE 2021 ACM ASIA CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2021, : 702 - 715
[8] Cross-Platform Binary Code Homology Analysis Based on GRU Graph Embedding
Wang, Shen
Jiang, Xunzhi
Yu, Xiangzhan
Su, Xiaohui
SECURITY AND COMMUNICATION NETWORKS, 2021, 2021
[9] Code Design and Performance Analysis Using a 2-Level Generalized Tanner Graph on the Binary Erasure Channel
Rosnes, Eirik
2008 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS, VOLS 1-3, 2008, : 1537 - 1542
[10] Research on the collaborative analysis technology for source code and binary executable based upon the unified defect mode set
Liang, Xiaobing
Cui, Baojiang
Lv, Yingjie
Fu, Yilun
2015 9TH INTERNATIONAL CONFERENCE ON INNOVATIVE MOBILE AND INTERNET SERVICES IN UBIQUITOUS COMPUTING IMIS 2015, 2015, : 260 - 264

← 1 →