Weak Labelling for File -level Source Code Classification

被引:2
|
作者
Sas, Cezar [1 ]
Capiluppi, Andrea [1 ]
机构
[1] Univ Groningen, Bernoulli Inst, Groningen, Netherlands
关键词
software classification; software categories keywords; semantic reverse engineering;
D O I
10.1109/SANER56733.2023.00074
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software repository hosting services contain large amounts of open-source software, with GitHub hosting over 200 million repositories, from new to established ones. However, these repositories are not easy to find, calling for various attempts to classify their application domains automatically. However, most proposed approaches use artifacts, like README files, as a proxy for the project, losing the information in the source code and the interaction between files. Furthermore, they all focus on the project-level, ignoring the decomposition of software projects into components and modules. This work presents a weak labelling approach based on keyword extraction to annotate source files in a software project. Our findings suggest that using keywords to perform file-level annotations is an effective approach that can capture enough information from the source file so that new labels can be predicted. The long-term goal of our research is to classify source code files and use these annotations to identify semantic components in software projects. In addition, these annotations can be used for semantic reverse engineering, software reuse, and more. We plan to train machine learning models that use our proposed weak supervision to better annotate source files inside software projects.
引用
收藏
页码:698 / 702
页数:5
相关论文
共 50 条
  • [1] Multi-granular software annotation using file-level weak labelling
    Sas, Cezar
    Capiluppi, Andrea
    EMPIRICAL SOFTWARE ENGINEERING, 2024, 29 (01)
  • [2] Multi-granular software annotation using file-level weak labelling
    Cezar Sas
    Andrea Capiluppi
    Empirical Software Engineering, 2024, 29
  • [3] ON THE DISTRIBUTION OF SOURCE CODE FILE SIZES
    Herraiz, Israel
    German, Daniel M.
    Hassan, Ahmed E.
    ICSOFT 2011: PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SOFTWARE AND DATABASE TECHNOLOGIES, VOL 2, 2011, : 5 - 14
  • [4] Identifying Source Code File Experts
    Cury, Otavio
    Avelino, Guilherme
    Neto, Pedro Santos
    Britto, Ricardo
    Valente, Marco Tulio
    PROCEEDINGS OF THE16TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT, ESEM 2022, 2022, : 125 - 136
  • [5] Revisiting file context for source code summarization
    Su, Chia-Yi
    Bansal, Aakash
    McMillan, Collin
    AUTOMATED SOFTWARE ENGINEERING, 2024, 31 (02)
  • [6] An Empirical Analysis for Predicting Source Code File Reusability Using Meta-Classification Algorithms
    Kaur, Loveleen
    Mishra, Ashutosh
    ADVANCED COMPUTATIONAL AND COMMUNICATION PARADIGMS, VOL 2, 2018, 706 : 493 - 504
  • [7] File Integrity Monitor Scheduling Based on File Security Level Classification
    Abdullah, Zul Hilmi
    Udzir, Nur Izura
    Mahmod, Ramlan
    Samsudin, Khairulmizam
    SOFTWARE ENGINEERING AND COMPUTER SYSTEMS, PT 2, 2011, 180 : 177 - +
  • [8] Examining the significance of high-level programming features in source code author classification
    Frantzeskou, Georgia
    MacDonell, Stephen
    Stamatatos, Efstathios
    Gritzalis, Stefanos
    JOURNAL OF SYSTEMS AND SOFTWARE, 2008, 81 (03) : 447 - 460
  • [9] Research on classification of malware source code
    Chia-Mei C.
    Gu-Hsin L.
    Journal of Shanghai Jiaotong University (Science), 2014, 19 (4) : 425 - 430
  • [10] Research on Classification of Malware Source Code
    陈嘉玫
    赖谷鑫
    JournalofShanghaiJiaotongUniversity(Science), 2014, 19 (04) : 425 - 430