Semantic clustering: Identifying topics in source code

被引：268

作者：

Kuhn, Adrian ^{[1
]}

Ducasse, Stephane

Girba, Tudor

机构：

[1] Univ Bern, Software Composit Grp, CH-3012 Bern, Switzerland

[2] Univ Savoie, LISTIC, Language & Software Evolut Grp, F-73011 Chambery, France

来源：

INFORMATION AND SOFTWARE TECHNOLOGY | 2007年 / 49卷 / 03期

关键词：

reverse engineering; clustering; latent semantic indexing; visualization;

D O I：

10.1016/j.infsof.2006.10.017

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Many of the existing approaches in Software Comprehension focus on program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming. This paper proposes the use of information retrieval to exploit linguistic information found in source code, such as identifier names and comments. We introduce Semantic Clustering, a technique based on Latent Semantic Indexing and clustering to group source artifacts that use similar vocabulary. We call these groups semantic clusters and we interpret them as linguistic topics that reveal the intention of the code. We compare the topics to each other, identify links between them, provide automatically retrieved labels, and use a visualization to illustrate how they are distributed over the system. Our approach is language independent as it works at the level of identifier names. To validate our approach we applied it on several case studies, two of which we present in this paper. Note: Some of the visualizations presented make heavy use of colors. Please obtain a color copy of the article for better understanding. (c) 2006 Elsevier B.V. All rights reserved.

引用

页码：230 / 243

页数：14

共 50 条

[1] Clustering Source Code Elements by Semantic Similarity Using Wikipedia
Schindler, Mirco
Fox, Oliver
Rausch, Andreas
[J]. 2015 IEEE/ACM FOURTH INTERNATIONAL WORKSHOP ON REALIZING ARTIFICIAL INTELLIGENCE SYNERGIES IN SOFTWARE ENGINEERING (RAISE 2015), 2015, : 13 - 18
[2] Identifying software decompositions by applying transaction clustering on source code
Sindhgatta, Renuka
Pooloth, Krishnakumar
[J]. COMPSAC 2007: THE THIRTY-FIRST ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, VOL I, PROCEEDINGS, 2007, : 317 - 324
[3] Identifying Semantic Outliers of Source Code Artifacts and Their Application to Software Architecture Recovery
Lee, Ki-Seong
Lee, Chan-Gun
[J]. IEEE ACCESS, 2020, 8 (08): : 212467 - 212477
[4] Scalable text semantic clustering around topics
Brena, Ramon
Ramirez, Eduardo
[J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (05) : 4645 - 4657
[5] Semantic Robustness of Models of Source Code
Henkel, Jordan
Ramakrishnan, Goutham
Wang, Zi
Albarghouthi, Aws
Jha, Somesh
Reps, Thomas
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, : 526 - 537
[6] Code Comments: A Way of Identifying Similarities in the Source Code
Folea, Rares
Slusanschi, Emil
[J]. MATHEMATICS, 2024, 12 (07)
[7] Estimating Semantic Relatedness in Source Code
Mahmoud, Anas
Bradshaw, Gary
[J]. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2015, 25 (01)
[8] Extraction of Code-mixed Aspect Topics in Semantic Representation
Asnani, Kavita
Pawar, Jyoti D.
[J]. COMPUTACION Y SISTEMAS, 2018, 22 (01): : 55 - 63
[9] Identifying parasitic malware as outliers by code clustering
Li, Hongcheng
Huang, Jianjun
Liang, Bin
Shi, Wenchang
Wu, Yifang
Bai, Shilei
[J]. JOURNAL OF COMPUTER SECURITY, 2020, 28 (02) : 157 - 189
[10] Identifying use cases in source code
Zhang, Lu
Qin, Tao
Zhou, Zhiying
Hao, Dan
Sun, Jiasu
[J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2006, 79 (11) : 1588 - 1598

← 1 2 3 4 5 →