Flowchart-Based Cross-Language Source Code Similarity Detection

被引:3
|
作者
Zhang, Feng [1 ,2 ]
Li, Guofan [1 ]
Liu, Cong [3 ]
Song, Qian [1 ]
机构
[1] Shandong Univ Sci & Technol, Coll Comp Sci & Engn, Qingdao 266590, Peoples R China
[2] Shandong Key Lab Wisdom Mine Informat Technol, Qingdao 266590, Peoples R China
[3] Shandong Univ Technol, Sch Comp Sci & Technol, Zibo 255000, Peoples R China
基金
美国国家科学基金会;
关键词
SOFTWARE PLAGIARISM DETECTION; SYSTEM; REUSE;
D O I
10.1155/2020/8835310
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Source code similarity detection has various applications in code plagiarism detection and software intellectual property protection. In computer programming teaching, students may convert the source code written in one programming language into another language for their code assignment submission. Existing similarity measures of source code written in the same language are not applicable for the cross-language code similarity detection because of syntactic differences among different programming languages. Meanwhile, existing cross-language source similarity detection approaches are susceptible to complex code obfuscation techniques, such as replacing equivalent control structure and adding redundant statements. To solve this problem, we propose a cross-language code similarity detection (CLCSD) approach based on code flowcharts. In general, two source code fragments written in different programming languages are transformed into standardized code flowcharts (SCFC), and their similarity is obtained by measuring their corresponding SCFC. More specifically, we first introduce the standardized code flowchart (SCFC) model to be the uniform flowcharts representation of source code written in different languages. SCFC is language-independent, and therefore, it can be used as the intermediate structure for source code similarity detection. Meanwhile, transformation techniques are given to transform source code written in a specific programming language into an SCFC. Second, we propose the SCFC-SPGK algorithm based on the shortest path graph kernel to measure the similarity between two SCFCs. Thus, the similarity between two pieces of source code in different programming languages is given by the similarity between SCFCs. Experimental results show that compared with existing approaches, CLCSD has higher accuracy in cross-language source code similarity detection. Furthermore, CLCSD cannot only handle common source code obfuscation techniques used by students in computer programming teaching but also obtain nearly 90% accuracy in dealing with some complex obfuscation techniques.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Cross-Language Code Similarity and Applications in Clone Detection and Code Search
    Mathew, George Varghese
    [J]. ProQuest Dissertations and Theses Global, 2022,
  • [2] Towards the Detection of Cross-Language Source Code Reuse
    Flores, Enrique
    Barron-Cedeno, Alberto
    Rosso, Paolo
    Moreno, Lidia
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2011, 6716 : 250 - 253
  • [3] GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching
    TehraniJamsaz, Ali
    Chen, Hanze
    Jannesari, Ali
    [J]. 2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024, 2024, : 506 - 515
  • [4] TCCCD: Triplet-Based Cross-Language Code Clone Detection
    Fang, Yong
    Zhou, Fangzheng
    Xu, Yijia
    Liu, Zhonglin
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (21):
  • [5] Detection of Software Security Weaknesses Using Cross-Language Source Code Representation (CLaSCoRe)
    Zaharia, Sergiu
    Rebedea, Traian
    Trausan-Matu, Stefan
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (13):
  • [6] TF-IDF-INSPIRED DETECTION FOR CROSS-LANGUAGE SOURCE CODE PLAGIARISM AND COLLUSION
    Karnalim, Oscar
    [J]. COMPUTER SCIENCE-AGH, 2020, 21 (01): : 113 - 136
  • [7] Cross-Language Source Code Re-Use Detection Using Latent Semantic Analysis
    Flores, Enrique
    Barron-Cedeno, Alberto
    Moreno, Lidia
    Rosso, Paolo
    [J]. JOURNAL OF UNIVERSAL COMPUTER SCIENCE, 2015, 21 (13) : 1708 - 1725
  • [8] Dynamic stacking ensemble for cross-language code smell detection
    Aljamaan, Hamoud
    [J]. PEERJ COMPUTER SCIENCE, 2024, 10
  • [9] Dynamic stacking ensemble for cross-language code smell detection
    Aljamaan, Hamoud
    [J]. PeerJ Computer Science, 2024, 10
  • [10] Graph-Based Similarity Analysis: A New Approach to Cross-Language Plagiarism Detection
    Franco-Salvador, Marc
    Gupta, Parth
    Rosso, Paolo
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2013, (50): : 21 - 28