Leveraging a Corpus of Natural Language Descriptions for Program Similarity

被引：11

作者：

Zilberstein, Meital ^{[1
]}

Yahav, Eran ^{[1
]}

机构：

[1] Technion, Haifa, Israel

来源：

ONWARD!'16: PROCEEDINGS OF THE 2016 ACM INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE | 2016年

关键词：

Code Similarity; Natural Language; Program Analysis; Semantics;

D O I：

10.1145/2986012.2986013

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

Program similarity is a central challenge in many programming-related applications, such as code search, clone detection, automatic translation, and programming education. We present a novel approach for establishing the similarity of code fragments by: (i) obtaining textual descriptions of code fragments captured in millions of posts on question-answering sites, blogs and other sources, and (ii) using natural language processing techniques to establish similarity between textual descriptions, and thus between their corresponding code fragments. To improve precision, we use a simple static analysis that extracts type signatures, and combine the results of textual similarity with similarity of the signatures. Because our notion of code similarity is based on similarity of textual descriptions, our approach can determine semantic relatedness and similarity of code across different libraries and even across different programming languages, a task considered extremely difficult using traditional approaches. To evaluate our approach, we use data obtained from the popular question-answering site, STACK-OVERFLOW. To obtain a ground-truth to compare against, we developed a crowdsourcing system, LIKE2DROPS, that allows users to label the similarity of code fragments. We used the system to collect similarity classifications for a massive corpus of 6,500 program pairs. Our results show that our technique is effective in determining similarity, and achieves more than 85 % precision, recall and accuracy.

引用

页码：197 / 211

页数：15

共 50 条

[1] Zoom: a corpus of natural language descriptions of map locations
Altamirano, Romina
Ferreira, Thiago C.
Paraboni, Ivandre
Benotti, Luciana
[J]. PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2, 2015, : 69 - 75
[2] Predicting odor mixture similarity leveraging natural language percepts
Meyer, Pablo
Dhurandhar, Amit
Cecchi, Guillermo
[J]. CHEMICAL SENSES, 2022, 47
[3] APPLICATION OF PROGRAM DESIGN LANGUAGE TOOLS TO ABBOTTS METHOD OF PROGRAM DESIGN BY INFORMAL NATURAL-LANGUAGE DESCRIPTIONS
BERRY, DM
YAVNE, N
YAVNE, M
[J]. JOURNAL OF SYSTEMS AND SOFTWARE, 1987, 7 (03) : 221 - 247
[4] A Corpus of Natural Multimodal Spatial Scene Descriptions
Han, Ting
Schlangen, David
[J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2113 - 2118
[5] Leveraging Sentence Similarity in Natural Language Generation: Improving Beam Search using Range Voting
Borgeaud, Sebastian
Emerson, Guy
[J]. NEURAL GENERATION AND TRANSLATION, 2020, : 97 - 109
[6] Converting the Corpus Query Language to the Natural Language
Rysava, Daniela
Volkova, Nikol
Rambousek, Adam
[J]. RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING (RASLAN 2015), 2015, : 43 - 48
[7] Visualizing Natural Language Descriptions: A Survey
Hassani, Kaveh
Lee, Won-Sook
[J]. ACM COMPUTING SURVEYS, 2016, 49 (01)
[8] Generating Customizable Natural Language Descriptions
Costa, A.
Paraboni, I
[J]. IEEE LATIN AMERICA TRANSACTIONS, 2019, 17 (08) : 1252 - 1258
[9] NATURAL-LANGUAGE DESCRIPTIONS OF PROCEDURES
OLSON, GM
TRAHAN, M
ROSHWALB, L
EATON, M
[J]. BULLETIN OF THE PSYCHONOMIC SOCIETY, 1983, 21 (05) : 353 - 353
[10] Novel Natural Language Summarization of Program Code via Leveraging Multiple Input Representations
Chen, Fuxiang
Kim, Mijung
Choo, Jaegul
[J]. Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021, 2021, : 2510 - 2520

← 1 2 3 4 5 →