Leveraging a Corpus of Natural Language Descriptions for Program Similarity

被引:11
|
作者
Zilberstein, Meital [1 ]
Yahav, Eran [1 ]
机构
[1] Technion, Haifa, Israel
关键词
Code Similarity; Natural Language; Program Analysis; Semantics;
D O I
10.1145/2986012.2986013
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Program similarity is a central challenge in many programming-related applications, such as code search, clone detection, automatic translation, and programming education. We present a novel approach for establishing the similarity of code fragments by: (i) obtaining textual descriptions of code fragments captured in millions of posts on question-answering sites, blogs and other sources, and (ii) using natural language processing techniques to establish similarity between textual descriptions, and thus between their corresponding code fragments. To improve precision, we use a simple static analysis that extracts type signatures, and combine the results of textual similarity with similarity of the signatures. Because our notion of code similarity is based on similarity of textual descriptions, our approach can determine semantic relatedness and similarity of code across different libraries and even across different programming languages, a task considered extremely difficult using traditional approaches. To evaluate our approach, we use data obtained from the popular question-answering site, STACK-OVERFLOW. To obtain a ground-truth to compare against, we developed a crowdsourcing system, LIKE2DROPS, that allows users to label the similarity of code fragments. We used the system to collect similarity classifications for a massive corpus of 6,500 program pairs. Our results show that our technique is effective in determining similarity, and achieves more than 85 % precision, recall and accuracy.
引用
收藏
页码:197 / 211
页数:15
相关论文
共 50 条
  • [1] Zoom: a corpus of natural language descriptions of map locations
    Altamirano, Romina
    Ferreira, Thiago C.
    Paraboni, Ivandre
    Benotti, Luciana
    [J]. PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 69 - 75
  • [2] Predicting odor mixture similarity leveraging natural language percepts
    Meyer, Pablo
    Dhurandhar, Amit
    Cecchi, Guillermo
    [J]. CHEMICAL SENSES, 2022, 47
  • [3] APPLICATION OF PROGRAM DESIGN LANGUAGE TOOLS TO ABBOTTS METHOD OF PROGRAM DESIGN BY INFORMAL NATURAL-LANGUAGE DESCRIPTIONS
    BERRY, DM
    YAVNE, N
    YAVNE, M
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 1987, 7 (03) : 221 - 247
  • [4] A Corpus of Natural Multimodal Spatial Scene Descriptions
    Han, Ting
    Schlangen, David
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2113 - 2118
  • [5] Leveraging Sentence Similarity in Natural Language Generation: Improving Beam Search using Range Voting
    Borgeaud, Sebastian
    Emerson, Guy
    [J]. NEURAL GENERATION AND TRANSLATION, 2020, : 97 - 109
  • [6] Converting the Corpus Query Language to the Natural Language
    Rysava, Daniela
    Volkova, Nikol
    Rambousek, Adam
    [J]. RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING (RASLAN 2015), 2015, : 43 - 48
  • [7] Visualizing Natural Language Descriptions: A Survey
    Hassani, Kaveh
    Lee, Won-Sook
    [J]. ACM COMPUTING SURVEYS, 2016, 49 (01)
  • [8] Generating Customizable Natural Language Descriptions
    Costa, A.
    Paraboni, I
    [J]. IEEE LATIN AMERICA TRANSACTIONS, 2019, 17 (08) : 1252 - 1258
  • [9] NATURAL-LANGUAGE DESCRIPTIONS OF PROCEDURES
    OLSON, GM
    TRAHAN, M
    ROSHWALB, L
    EATON, M
    [J]. BULLETIN OF THE PSYCHONOMIC SOCIETY, 1983, 21 (05) : 353 - 353
  • [10] Novel Natural Language Summarization of Program Code via Leveraging Multiple Input Representations
    Chen, Fuxiang
    Kim, Mijung
    Choo, Jaegul
    [J]. Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021, 2021, : 2510 - 2520