Using Latent Dirichlet Allocation for Automatic Categorization of Software

被引:100
|
作者
Tian, Kai [1 ]
Revelle, Meghan [1 ]
Poshyvanyk, Denys [1 ]
机构
[1] Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA
关键词
D O I
10.1109/MSR.2009.5069496
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In this paper, we propose a technique called LACT for automatically categorizing software systems in open-source repositories. LACT is based on Latent Dirichlet Allocation, an information retrieval method which is used to index and analyze source code documents as mixtures of probabilistic topics. For an initial evaluation, we performed two studies. In the first study, LACT was compared against an existing tool, MUDABlue, for classifying 41 software systems written in C into problem domain categories. The results indicate that LACT can automatically produce meaningful category names and yield classification results comparable to MUDABlue. In the second study, we applied LACT to 43 software systems written in different programming languages such as C/C++, Java, C#, PHP, and Perl. The results indicate that LACT can be used effectively for the automatic categorization of software systems regardless of the underlying programming language or paradigm. Moreover, both studies indicate that LACT can identify several new categories that are based on libraries, architectures, or programming languages, which is a promising improvement as compared to manual categorization and existing techniques.
引用
收藏
页码:163 / 166
页数:4
相关论文
共 50 条
  • [1] Latent Dirichlet Allocation for Automatic Document Categorization
    Biro, Istvan
    Szabo, Jacint
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 430 - 441
  • [2] On the Effectiveness of Labeled Latent Dirichlet Allocation in Automatic Bug-Report Categorization
    Zibran, Minhaz F.
    [J]. 2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C), 2016, : 713 - 715
  • [3] Supervised labeled latent Dirichlet allocation for document categorization
    Li, Ximing
    Ouyang, Jihong
    Zhou, Xiaotang
    Lu, You
    Liu, Yanhui
    [J]. APPLIED INTELLIGENCE, 2015, 42 (03) : 581 - 593
  • [4] Supervised labeled latent Dirichlet allocation for document categorization
    Ximing Li
    Jihong Ouyang
    Xiaotang Zhou
    You Lu
    Yanhui Liu
    [J]. Applied Intelligence, 2015, 42 : 581 - 593
  • [5] News Topics Categorization Using Latent Dirichlet Allocation and Sparse Representation Classifier
    Lee, Yuan-Shan
    Lo, Rocky
    Chen, Chia-Yen
    Lin, Po-Chuan
    Wang, Jia-Ching
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2015, : 136 - 137
  • [6] Automatic Generation of Product Association Networks Using Latent Dirichlet Allocation
    Sanchez-Monzon, Javier
    Putzke, Johannes
    Fischbach, Kai
    [J]. 2ND COLLABORATIVE INNOVATION NETWORKS CONFERENCE (COINS2010), 2011, 26
  • [7] LARGen: Automatic Signature Generation for Malwares Using Latent Dirichlet Allocation
    Lee, Suchul
    Kim, Sungho
    Lee, Sungil
    Choi, Jaehyuk
    Yoon, Hanjun
    Lee, Dohoon
    Lee, Jun-Rak
    [J]. IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2018, 15 (05) : 771 - 783
  • [8] An application of Latent Dirichlet Allocation to analyzing software evolution
    Bren School of Information and Computer Sciences, University of California, Irvine, United States
    [J]. Proc. - Int. Conf. Mach. Learn. Appl., ICMLA, 1600, (813-818):
  • [9] An Approach for Automatic Aspect Extraction by Latent Dirichlet Allocation
    Das, Subha Jyoti
    Chakraborty, Basabi
    [J]. 2019 IEEE 10TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE AND TECHNOLOGY (ICAST 2019), 2019, : 407 - 412
  • [10] An Application of Latent Dirichlet Allocation to Analyzing Software Evolution
    Linstead, Erik
    Lopes, Cristina
    Baldi, Pierre
    [J]. SEVENTH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, PROCEEDINGS, 2008, : 813 - 818