Semantic topic models for source code analysis

被引:11
|
作者
Mahmoud, Anas [1 ]
Bradshaw, Gary [2 ]
机构
[1] Louisiana State Univ, Div Comp Sci & Engn, Baton Rouge, LA 70803 USA
[2] Mississippi State Univ, Dept Psychol, Mississippi State, MS 39762 USA
关键词
Clustering; Information theory; Topic modeling; LATENT; SOFTWARE;
D O I
10.1007/s10664-016-9473-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Topic modeling techniques have been recently applied to analyze and model source code. Such techniques exploit the textual content of source code to provide automated support for several basic software engineering activities. Despite these advances, applications of topic modeling in software engineering are frequently suboptimal. This can be attributed to the fact that current state-of-the-art topic modeling techniques tend to be data intensive. However, the textual content of source code, embedded in its identifiers, comments, and string literals, tends to be sparse in nature. This prevents classical topic modeling techniques, typically used to model natural language texts, to generate proper models when applied to source code. Furthermore, the operational complexity and multi-parameter calibration often associated with conventional topic modeling techniques raise important concerns about their feasibility as data analysis models in software engineering. Motivated by these observations, in this paper we propose a novel approach for topic modeling designed for source code. The proposed approach exploits the basic assumptions of the cluster hypothesis and information theory to discover semantically coherent topics in software systems. Ten software systems from different application domains are used to empirically calibrate and configure the proposed approach. The usefulness of generated topics is empirically validated using human judgment. Furthermore, a case study that demonstrates thet operation of the proposed approach in analyzing code evolution is reported. The results show that our approach produces stable, more interpretable, and more expressive topics than classical topic modeling techniques without the necessity for extensive parameter calibration.
引用
收藏
页码:1965 / 2000
页数:36
相关论文
共 50 条
  • [1] Semantic topic models for source code analysis
    Anas Mahmoud
    Gary Bradshaw
    Empirical Software Engineering, 2017, 22 : 1965 - 2000
  • [2] Semantic Robustness of Models of Source Code
    Henkel, Jordan
    Ramakrishnan, Goutham
    Wang, Zi
    Albarghouthi, Aws
    Jha, Somesh
    Reps, Thomas
    2022 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ANALYSIS, EVOLUTION AND REENGINEERING (SANER 2022), 2022, : 526 - 537
  • [3] ITMViz: Interactive Topic Modeling for Source Code Analysis
    Saeidi, Amir M.
    Hage, Jurriaan
    Khadka, Ravi
    Jansen, Slinger
    2015 IEEE 23RD INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION ICPC 2015, 2015, : 295 - 298
  • [4] Semantic Web - The Missing Link in Global Source Code Analysis?
    Keivanloo, Iman
    Rilling, Juergen
    Charland, Philippe
    2012 IEEE 36TH ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), 2012, : 541 - 550
  • [5] Estimating Semantic Relatedness in Source Code
    Mahmoud, Anas
    Bradshaw, Gary
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2015, 25 (01)
  • [6] SEMANTIC ANALYSIS OF TOPIC AND FOCUS
    VLK, T
    KYBERNETIKA, 1989, 25 (06) : 523 - 532
  • [7] Comparative Analysis of Large Language Models in Source Code Analysis
    Erdoğan, Hüseyin
    Turan, Nezihe Turhan
    Onan, Aytuğ
    Lecture Notes in Networks and Systems, 2024, 1088 LNNS : 185 - 192
  • [8] Comparative Analysis of Large Language Models in Source Code Analysis
    Erdogan, Huseyin
    Turan, Nezihe Turhan
    Onan, Aytug
    INTELLIGENT AND FUZZY SYSTEMS, INFUS 2024 CONFERENCE, VOL 1, 2024, 1088 : 185 - 192
  • [9] An Analytical Review of the Source Code Models for Exploit Analysis
    Fedorchenko, Elena
    Novikova, Evgenia
    Fedorchenko, Andrey
    Verevkin, Sergei
    INFORMATION, 2023, 14 (09)
  • [10] Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis
    Naik, Shounak
    Patil, Rajaswa
    Agarwal, Swati
    Baths, Veeky
    ADVANCED DATA MINING AND APPLICATIONS, ADMA 2022, PT II, 2022, 13726 : 395 - 406