Semantic clustering: Identifying topics in source code

被引:268
|
作者
Kuhn, Adrian [1 ]
Ducasse, Stephane
Girba, Tudor
机构
[1] Univ Bern, Software Composit Grp, CH-3012 Bern, Switzerland
[2] Univ Savoie, LISTIC, Language & Software Evolut Grp, F-73011 Chambery, France
关键词
reverse engineering; clustering; latent semantic indexing; visualization;
D O I
10.1016/j.infsof.2006.10.017
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many of the existing approaches in Software Comprehension focus on program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming. This paper proposes the use of information retrieval to exploit linguistic information found in source code, such as identifier names and comments. We introduce Semantic Clustering, a technique based on Latent Semantic Indexing and clustering to group source artifacts that use similar vocabulary. We call these groups semantic clusters and we interpret them as linguistic topics that reveal the intention of the code. We compare the topics to each other, identify links between them, provide automatically retrieved labels, and use a visualization to illustrate how they are distributed over the system. Our approach is language independent as it works at the level of identifier names. To validate our approach we applied it on several case studies, two of which we present in this paper. Note: Some of the visualizations presented make heavy use of colors. Please obtain a color copy of the article for better understanding. (c) 2006 Elsevier B.V. All rights reserved.
引用
收藏
页码:230 / 243
页数:14
相关论文
共 50 条
  • [1] Clustering Source Code Elements by Semantic Similarity Using Wikipedia
    Schindler, Mirco
    Fox, Oliver
    Rausch, Andreas
    [J]. 2015 IEEE/ACM FOURTH INTERNATIONAL WORKSHOP ON REALIZING ARTIFICIAL INTELLIGENCE SYNERGIES IN SOFTWARE ENGINEERING (RAISE 2015), 2015, : 13 - 18
  • [2] Identifying software decompositions by applying transaction clustering on source code
    Sindhgatta, Renuka
    Pooloth, Krishnakumar
    [J]. COMPSAC 2007: THE THIRTY-FIRST ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE, VOL I, PROCEEDINGS, 2007, : 317 - 324
  • [3] Identifying Semantic Outliers of Source Code Artifacts and Their Application to Software Architecture Recovery
    Lee, Ki-Seong
    Lee, Chan-Gun
    [J]. IEEE ACCESS, 2020, 8 (08): : 212467 - 212477
  • [4] Scalable text semantic clustering around topics
    Brena, Ramon
    Ramirez, Eduardo
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 36 (05) : 4645 - 4657
  • [5] Semantic Robustness of Models of Source Code
    Henkel, Jordan
    Ramakrishnan, Goutham
    Wang, Zi
    Albarghouthi, Aws
    Jha, Somesh
    Reps, Thomas
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, : 526 - 537
  • [6] Code Comments: A Way of Identifying Similarities in the Source Code
    Folea, Rares
    Slusanschi, Emil
    [J]. MATHEMATICS, 2024, 12 (07)
  • [7] Estimating Semantic Relatedness in Source Code
    Mahmoud, Anas
    Bradshaw, Gary
    [J]. ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2015, 25 (01)
  • [8] Extraction of Code-mixed Aspect Topics in Semantic Representation
    Asnani, Kavita
    Pawar, Jyoti D.
    [J]. COMPUTACION Y SISTEMAS, 2018, 22 (01): : 55 - 63
  • [9] Identifying parasitic malware as outliers by code clustering
    Li, Hongcheng
    Huang, Jianjun
    Liang, Bin
    Shi, Wenchang
    Wu, Yifang
    Bai, Shilei
    [J]. JOURNAL OF COMPUTER SECURITY, 2020, 28 (02) : 157 - 189
  • [10] Identifying use cases in source code
    Zhang, Lu
    Qin, Tao
    Zhou, Zhiying
    Hao, Dan
    Sun, Jiasu
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2006, 79 (11) : 1588 - 1598