Semantic topic models for source code analysis

被引:11
|
作者
Mahmoud, Anas [1 ]
Bradshaw, Gary [2 ]
机构
[1] Louisiana State Univ, Div Comp Sci & Engn, Baton Rouge, LA 70803 USA
[2] Mississippi State Univ, Dept Psychol, Mississippi State, MS 39762 USA
关键词
Clustering; Information theory; Topic modeling; LATENT; SOFTWARE;
D O I
10.1007/s10664-016-9473-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Topic modeling techniques have been recently applied to analyze and model source code. Such techniques exploit the textual content of source code to provide automated support for several basic software engineering activities. Despite these advances, applications of topic modeling in software engineering are frequently suboptimal. This can be attributed to the fact that current state-of-the-art topic modeling techniques tend to be data intensive. However, the textual content of source code, embedded in its identifiers, comments, and string literals, tends to be sparse in nature. This prevents classical topic modeling techniques, typically used to model natural language texts, to generate proper models when applied to source code. Furthermore, the operational complexity and multi-parameter calibration often associated with conventional topic modeling techniques raise important concerns about their feasibility as data analysis models in software engineering. Motivated by these observations, in this paper we propose a novel approach for topic modeling designed for source code. The proposed approach exploits the basic assumptions of the cluster hypothesis and information theory to discover semantically coherent topics in software systems. Ten software systems from different application domains are used to empirically calibrate and configure the proposed approach. The usefulness of generated topics is empirically validated using human judgment. Furthermore, a case study that demonstrates thet operation of the proposed approach in analyzing code evolution is reported. The results show that our approach produces stable, more interpretable, and more expressive topics than classical topic modeling techniques without the necessity for extensive parameter calibration.
引用
收藏
页码:1965 / 2000
页数:36
相关论文
共 50 条
  • [21] Software system comparison with semantic source code embeddings
    Karakatic, Saso
    Milosevic, Aleksej
    Hericko, Tjasa
    EMPIRICAL SOFTWARE ENGINEERING, 2022, 27 (03)
  • [22] Semantic Similarity Metrics for Evaluating Source Code Summarization
    Haque, Sakib
    Eberhart, Zachary
    Bansal, Aakash
    McMillan, Collin
    30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 36 - 47
  • [23] Semantic Similarity Metrics for Evaluating Source Code Summarization
    Haque, Sakib
    Eberhart, Zachary
    Bansal, Aakash
    McMillan, Collin
    IEEE International Conference on Program Comprehension, 2022, 2022-March : 36 - 47
  • [24] Software system comparison with semantic source code embeddings
    Sašo Karakatič
    Aleksej Miloševič
    Tjaša Heričko
    Empirical Software Engineering, 2022, 27
  • [26] Applying a Semantic Layer in a Source Code Search Tool
    Durao, Frederico A.
    Vanderlei, Taciana A.
    Almeida, Eduardo S.
    Meira, Silvio R. de L.
    APPLIED COMPUTING 2008, VOLS 1-3, 2008, : 1151 - 1157
  • [27] Semantic similarity loss for neural source code summarization
    Su, Chia-Yi
    McMillan, Collin
    JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2024, 36 (11)
  • [28] Semantic Topic Analysis of Traffic Camera Images
    Liu, Jeffrey
    Weinert, Andrew
    Amin, Saurabh
    2018 21ST INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), 2018, : 568 - 574
  • [29] Explicit Semantic Analysis as a Means for Topic Labelling
    Kriukova, Anna
    Erofeeva, Aliia
    Mitrofanova, Olga
    Sukharev, Kirill
    ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE (AINL 2018), 2018, 930 : 110 - 116
  • [30] Vulnerability detection tool in source code by building and leveraging semantic code graph
    Delaitre, Sabine
    Pulgar Gutierrez, Jose Maria
    19TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY, AND SECURITY, ARES 2024, 2024,