Cross-Language Code Search using Static and Dynamic Analyses

被引:15
|
作者
Mathew, George [1 ]
Stolee, Kathryn T. [1 ]
机构
[1] North Carolina State Univ, Raleigh, NC 27695 USA
基金
美国国家科学基金会;
关键词
code-to-code search; cross-language code search; non-dominated sorting; static analysis; dynamic analysis; CLONE; ALGORITHM; TREES;
D O I
10.1145/3468264.3468538
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
As code search permeates most activities in software development, code-to-code search has emerged to support using code as a query and retrieving similar code in the search results. Applications include duplicate code detection for refactoring, patch identification for program repair, and language translation. Existing code-to-code search tools rely on static similarity approaches such as the comparison of tokens and abstract syntax trees (AST) to approximate dynamic behavior, leading to low precision. Most tools do not support cross-language code-to-code search, and those that do, rely on machine learning models that require labeled training data. We present Code-to-Code Search Across Languages (COSAL), a cross-language technique that uses both static and dynamic analyses to identify similar code and does not require a machine learning model. Code snippets are ranked using non-dominated sorting based on code token similarity, structural similarity, and behavioral similarity. We empirically evaluate COSAL on two datasets of 43,146 Java and Python files and 55,499 Java files and find that 1) code search based on non-dominated ranking of static and dynamic similarity measures is more effective compared to single or weighted measures; and 2) COSAL has better precision and recall compared to state-of-the-art within-language and cross-language code-to-code search tools. We explore the potential for using COSAL on large open-source repositories and discuss scalability to more languages and similarity metrics, providing a gateway for practical, multi-language code-to-code search.
引用
收藏
页码:205 / 217
页数:13
相关论文
共 50 条
  • [1] Cross-Language Code Similarity and Applications in Clone Detection and Code Search
    Mathew, George Varghese
    [J]. ProQuest Dissertations and Theses Global, 2022,
  • [2] Dynamic stacking ensemble for cross-language code smell detection
    Aljamaan, Hamoud
    [J]. PEERJ COMPUTER SCIENCE, 2024, 10
  • [3] Dynamic stacking ensemble for cross-language code smell detection
    Aljamaan, Hamoud
    [J]. PeerJ Computer Science, 2024, 10
  • [4] Cross-Language Interoperability of Heterogeneous Code
    Stratikopoulos, Athanasios
    Blanaru, Florin
    Fumero, Juan
    Xekalaki, Maria
    Papadakis, Orion
    Kotselidis, Christos
    [J]. COMPANION PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON THE ART, SCIENCE, AND ENGINEERING OF PROGRAMMING, PROGRAMMING 2023, 2023, : 17 - 21
  • [5] A Framework for Cross-language Search Personalization
    Ghorab, M. Rami
    Zhou, Dong
    O'Connor, Alexander
    Wade, Vincent
    [J]. PROCEEDINGS 2009 FOURTH INTERNATIONAL WORKSHOP ON SEMANTIC MEDIA ADAPTATION AND PERSONALIZATION, 2009, : 15 - 20
  • [6] An Investigation of Decompounding for Cross-Language Patent Search
    Leveling, Johannes
    Magdy, Walid
    Jones, Gareth J. F.
    [J]. PROCEEDINGS OF THE 34TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR'11), 2011, : 1169 - 1170
  • [7] Using cross-language information retrieval methods for bilingual search of the web
    Shim, Sung J.
    [J]. International Conference on Computational Intelligence for Modelling, Control & Automation Jointly with International Conference on Intelligent Agents, Web Technologies & Internet Commerce, Vol 2, Proceedings, 2006, : 19 - 23
  • [8] Towards the Detection of Cross-Language Source Code Reuse
    Flores, Enrique
    Barron-Cedeno, Alberto
    Rosso, Paolo
    Moreno, Lidia
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2011, 6716 : 250 - 253
  • [9] SOLDER: Retrofitting Legacy Code with Cross-Language Patches
    Williams, Ryan
    Gavazzi, Anthony
    Kirda, Engin
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING, SANER, 2023, : 49 - 60
  • [10] Detection of Software Security Weaknesses Using Cross-Language Source Code Representation (CLaSCoRe)
    Zaharia, Sergiu
    Rebedea, Traian
    Trausan-Matu, Stefan
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (13):