Semantic topic models for source code analysis

被引:11
|
作者
Mahmoud, Anas [1 ]
Bradshaw, Gary [2 ]
机构
[1] Louisiana State Univ, Div Comp Sci & Engn, Baton Rouge, LA 70803 USA
[2] Mississippi State Univ, Dept Psychol, Mississippi State, MS 39762 USA
关键词
Clustering; Information theory; Topic modeling; LATENT; SOFTWARE;
D O I
10.1007/s10664-016-9473-1
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Topic modeling techniques have been recently applied to analyze and model source code. Such techniques exploit the textual content of source code to provide automated support for several basic software engineering activities. Despite these advances, applications of topic modeling in software engineering are frequently suboptimal. This can be attributed to the fact that current state-of-the-art topic modeling techniques tend to be data intensive. However, the textual content of source code, embedded in its identifiers, comments, and string literals, tends to be sparse in nature. This prevents classical topic modeling techniques, typically used to model natural language texts, to generate proper models when applied to source code. Furthermore, the operational complexity and multi-parameter calibration often associated with conventional topic modeling techniques raise important concerns about their feasibility as data analysis models in software engineering. Motivated by these observations, in this paper we propose a novel approach for topic modeling designed for source code. The proposed approach exploits the basic assumptions of the cluster hypothesis and information theory to discover semantically coherent topics in software systems. Ten software systems from different application domains are used to empirically calibrate and configure the proposed approach. The usefulness of generated topics is empirically validated using human judgment. Furthermore, a case study that demonstrates thet operation of the proposed approach in analyzing code evolution is reported. The results show that our approach produces stable, more interpretable, and more expressive topics than classical topic modeling techniques without the necessity for extensive parameter calibration.
引用
收藏
页码:1965 / 2000
页数:36
相关论文
共 50 条
  • [31] On the Impact of UML Analysis Models on Source-Code Comprehensibility and Modifiability
    Scanniello, Giuseppe
    Gravino, Carmine
    Genero, Marcela
    Cruz-Lemus, Jose A.
    Tortora, Genoveffa
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2014, 23 (02)
  • [32] The SQALE Quality and Analysis Models for Assessing the Quality of Ada Source Code
    Coq, Thierry
    Rosen, Jean-Pierre
    RELIABLE SOFTWARE TECHNOLOGIES - ADA-EUROPE 2011, 2011, 6652 : 61 - 74
  • [33] Requirements Verification Through the Analysis of Source Code by Large Language Models
    Couder, Juan Ortiz
    Gomez, Dawson
    Ochoa, Omar
    SOUTHEASTCON 2024, 2024, : 75 - 80
  • [34] Backdoors in Neural Models of Source Code
    Ramakrishnan, Goutham
    Albarghouthi, Aws
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 2892 - 2899
  • [35] Semantic community detection research based on topic probability models
    Xin, Yu
    Xie, Zhi-Qiang
    Yang, Jing
    Zidonghua Xuebao/Acta Automatica Sinica, 2015, 41 (10): : 1693 - 1710
  • [36] Translation of behavioral models to source code
    Sunith, E., V
    Samuel, Philip
    2012 12TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS (ISDA), 2012, : 598 - 603
  • [37] Mixture of Topic-based Distributional Semantic and Affective Models
    Christopoulou, Fenia
    Briakou, Eleftheria
    Iosif, Elias
    Potamianos, Alexandros
    2018 IEEE 12TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC), 2018, : 203 - 210
  • [38] Supervised Topic Models for Diagnosis Code Assignment to Discharge Summaries
    Dermouche, Mohamed
    Velcin, Julien
    Flicoteaux, Remi
    Chevret, Sylvie
    Taright, Namik
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, (CICLING 2016), PT II, 2018, 9624 : 485 - 497
  • [39] Renormalization Analysis of Topic Models
    Koltcov, Sergei
    Ignatenko, Vera
    ENTROPY, 2020, 22 (05)
  • [40] Software trustworthiness 2.0-A semantic web enabled global source code analysis approach
    Keivanloo, Iman
    Rilling, Juergen
    JOURNAL OF SYSTEMS AND SOFTWARE, 2014, 89 : 33 - 50