Semantic topic models for source code analysis

被引:11
|
作者
Mahmoud, Anas [1 ]
Bradshaw, Gary [2 ]
机构
[1] Louisiana State Univ, Div Comp Sci & Engn, Baton Rouge, LA 70803 USA
[2] Mississippi State Univ, Dept Psychol, Mississippi State, MS 39762 USA
关键词
Clustering; Information theory; Topic modeling; LATENT; SOFTWARE;
D O I
10.1007/s10664-016-9473-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Topic modeling techniques have been recently applied to analyze and model source code. Such techniques exploit the textual content of source code to provide automated support for several basic software engineering activities. Despite these advances, applications of topic modeling in software engineering are frequently suboptimal. This can be attributed to the fact that current state-of-the-art topic modeling techniques tend to be data intensive. However, the textual content of source code, embedded in its identifiers, comments, and string literals, tends to be sparse in nature. This prevents classical topic modeling techniques, typically used to model natural language texts, to generate proper models when applied to source code. Furthermore, the operational complexity and multi-parameter calibration often associated with conventional topic modeling techniques raise important concerns about their feasibility as data analysis models in software engineering. Motivated by these observations, in this paper we propose a novel approach for topic modeling designed for source code. The proposed approach exploits the basic assumptions of the cluster hypothesis and information theory to discover semantically coherent topics in software systems. Ten software systems from different application domains are used to empirically calibrate and configure the proposed approach. The usefulness of generated topics is empirically validated using human judgment. Furthermore, a case study that demonstrates thet operation of the proposed approach in analyzing code evolution is reported. The results show that our approach produces stable, more interpretable, and more expressive topics than classical topic modeling techniques without the necessity for extensive parameter calibration.
引用
收藏
页码:1965 / 2000
页数:36
相关论文
共 50 条
  • [41] Source Code Level Word Embeddings in Aiding Semantic Test-to-Code Traceability
    Csuvik, Viktor
    Kicsi, Andras
    Vidacs, Laszlo
    2019 IEEE/ACM 10TH INTERNATIONAL WORKSHOP ON SOFTWARE AND SYSTEMS TRACEABILITY (SST 2019), 2019, : 29 - 36
  • [42] STraceBERT: Source Code Retrieval using Semantic Application Traces
    Spiess, Claudio
    PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023, 2023, : 2207 - 2209
  • [43] An Unbiased Transformer Source Code Learning with Semantic Vulnerability Graph
    Islam, Nafis Tanveer
    Parra, Gonzalo De La Torre
    Manuel, Dylan
    Bou-Harb, Elias
    Najafirad, Peyman
    2023 IEEE 8TH EUROPEAN SYMPOSIUM ON SECURITY AND PRIVACY, EUROS&P, 2023, : 144 - 159
  • [44] A SOURCE CODE CONTROL-SYSTEM BASED ON SEMANTIC NETS
    INCE, DC
    SOFTWARE-PRACTICE & EXPERIENCE, 1984, 14 (12): : 1159 - 1168
  • [45] Unified Topic-Based Semantic Models: A Study in Computing the Semantic Relatedness of Geographic Terms
    Sadr, Hossein
    Soleimandarabi, Mojdeh Nazari
    Pedram, Mir Mohsen
    Teshnelab, Mohammad
    2019 5TH INTERNATIONAL CONFERENCE ON WEB RESEARCH (ICWR), 2019, : 134 - 140
  • [46] Clustering Source Code Elements by Semantic Similarity Using Wikipedia
    Schindler, Mirco
    Fox, Oliver
    Rausch, Andreas
    2015 IEEE/ACM FOURTH INTERNATIONAL WORKSHOP ON REALIZING ARTIFICIAL INTELLIGENCE SYNERGIES IN SOFTWARE ENGINEERING (RAISE 2015), 2015, : 13 - 18
  • [47] Bug localization based on syntactical and semantic information of source code
    Yan, Xuefeng
    Cheng, Shasha
    Guo, Liqin
    JOURNAL OF SYSTEMS ENGINEERING AND ELECTRONICS, 2023, 34 (01) : 236 - 246
  • [48] SOURCE CODE CONTROL SYSTEM BASED ON SEMANTIC NETS.
    Ince, D.C.
    Software - Practice and Experience, 1984, 14 (12) : 1159 - 1168
  • [49] IdBench: Evaluating Semantic Representations of Identifier Names in Source Code
    Wainakh, Yaza
    Rauf, Moiz
    Pradel, Michael
    2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 562 - 573
  • [50] Bug localization based on syntactical and semantic information of source code
    YAN Xuefeng
    CHENG Shasha
    GUO Liqin
    JournalofSystemsEngineeringandElectronics, 2023, 34 (01) : 236 - 246