Statistical Unigram Analysis for Source Code Repository

被引:12
|
作者
Xu, Weifeng [1 ]
Xu, Dianxiang [1 ]
El Ariss, Omar [2 ]
Liu, Yunkai [3 ]
Alatawi, Abdulrahman [1 ]
机构
[1] Bowie State Univ, Dept Comp Sci, Bowie, MD 20715 USA
[2] Penn State Univ Harrisburg, Dept Comp Sci, Middletown, PA USA
[3] Gannon Univ, Dept Comp & Informat Sci, Erie, PA USA
关键词
programming language; source code; n-gram; unigram; abbreviations; ultra-large-scale analysis;
D O I
10.1109/BigMM.2017.13
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Unigram is a fundamental element of n-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultralarge source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub. com. By analyzing these unigrams, we have discovered statistical patterns regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. Our study describes a probabilistic model for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. It shows that the unigrams collected from source code repositories are essential resources to solving the domain specific problems.
引用
收藏
页码:1 / 8
页数:8
相关论文
共 50 条
  • [21] Executable source code and non-executable source code: analysis and relationships
    Robles, G
    Gonzalez-Barahona, JM
    FOURTH IEEE INTERNATIONAL WORKSHOP ON SOURCE CODE ANALYSIS AND MANIPULATION, PROCEEDINGS, 2004, : 149 - 157
  • [22] A Stage Model of Open Source Activities: An Exploratory Analysis on Open Source Repository
    Yamakami, Toshihiko
    12TH INTERNATIONAL CONFERENCE ON ADVANCED COMMUNICATION TECHNOLOGY: ICT FOR GREEN GROWTH AND SUSTAINABLE DEVELOPMENT, VOLS 1 AND 2, 2010, : 915 - 919
  • [23] Detecting and Ranking API Usage Pattern in Large Source Code Repository: A LFM Based Approach
    Zhao, Jitong
    Liu, Yan
    MACHINE LEARNING AND KNOWLEDGE EXTRACTION, CD-MAKE 2017, 2017, 10410 : 41 - 56
  • [24] Learning to Generate Pseudo-code from Source Code using Statistical Machine Translation
    Oda, Yusuke
    Fudaba, Hiroyuki
    Neubig, Graham
    Hata, Hideaki
    Sakti, Sakriani
    Toda, Tomoki
    Nakamura, Satoshi
    2015 30TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE), 2015, : 574 - 584
  • [25] Source Code Analysis and Manipulation - Introduction
    Binkley, D
    Burd, L
    Harman, M
    Tonella, P
    SOFTWARE QUALITY JOURNAL, 2004, 12 (04) : 293 - 295
  • [26] Visual Analysis of Source Code Similarities
    Burch, Michael
    Strotzer, Julian
    Weiskopf, Daniel
    2015 19TH INTERNATIONAL CONFERENCE ON INFORMATION VISUALISATION IV 2015, 2015, : 21 - 27
  • [27] An extensible system for source code analysis
    Canfora, G
    Cimitile, A
    De Carlini, U
    De Lucia, A
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1998, 24 (09) : 721 - 740
  • [28] Compiler hacking for source code analysis
    Antoniol, G
    Di Penta, M
    Masone, G
    Villano, U
    SOFTWARE QUALITY JOURNAL, 2004, 12 (04) : 383 - 406
  • [29] PyLocky Ransomware Source Code Analysis
    Sorini, Adam
    Scott, Gavin D.
    2020 IEEE SYMPOSIUM ON PRODUCT COMPLIANCE ENGINEERING (IEEE SPCE 2020), 2020,
  • [30] Supporting source code difference analysis
    Maletic, JI
    Collard, ML
    20TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, PROCEEDINGS, 2004, : 210 - 219