Weak Labelling for File -level Source Code Classification

被引:2
|
作者
Sas, Cezar [1 ]
Capiluppi, Andrea [1 ]
机构
[1] Univ Groningen, Bernoulli Inst, Groningen, Netherlands
关键词
software classification; software categories keywords; semantic reverse engineering;
D O I
10.1109/SANER56733.2023.00074
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software repository hosting services contain large amounts of open-source software, with GitHub hosting over 200 million repositories, from new to established ones. However, these repositories are not easy to find, calling for various attempts to classify their application domains automatically. However, most proposed approaches use artifacts, like README files, as a proxy for the project, losing the information in the source code and the interaction between files. Furthermore, they all focus on the project-level, ignoring the decomposition of software projects into components and modules. This work presents a weak labelling approach based on keyword extraction to annotate source files in a software project. Our findings suggest that using keywords to perform file-level annotations is an effective approach that can capture enough information from the source file so that new labels can be predicted. The long-term goal of our research is to classify source code files and use these annotations to identify semantic components in software projects. In addition, these annotations can be used for semantic reverse engineering, software reuse, and more. We plan to train machine learning models that use our proposed weak supervision to better annotate source files inside software projects.
引用
收藏
页码:698 / 702
页数:5
相关论文
共 50 条
  • [41] LABELLING OF PHARMACEUTICAL PREPARATIONS - SUGGESTED CODE
    COLLIER, WAL
    LANCET, 1962, 1 (7227): : 473 - &
  • [42] Source Code and Binary Level Vulnerability Detection and Hot Patching
    Xu, Zhengzi
    2020 35TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2020), 2020, : 1397 - 1399
  • [43] A comparative evaluation of techniques for syntactic level source code analysis
    Cox, A
    Clarke, C
    SEVENTH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, PROCEEDINGS, 2000, : 282 - 289
  • [44] Source-level loop optimization for DSP code generation
    Su, BG
    Wang, J
    Esguerra, A
    ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI, 1999, : 2155 - 2158
  • [45] Identification of high-level concept clones in source code
    Marcus, A
    Maletic, JI
    16TH ANNUAL INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2001), PROCEEDINGS, 2001, : 107 - 114
  • [46] CodeShovel: Constructing Method-Level Source Code Histories
    Grund, Felix
    Chowdhury, Shaiful
    Bradley, Nick C.
    Hall, Braxton
    Holmes, Reid
    2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 1510 - 1522
  • [47] Verifying compiled file system code
    Muehlberg, Jan Tobias
    Luettgen, Gerald
    FORMAL ASPECTS OF COMPUTING, 2012, 24 (03) : 375 - 391
  • [48] Verifying Compiled File System Code
    Muhlberg, Jan Tobias
    Luttgen, Gerald
    FORMAL METHODS: FOUNDATIONS AND APPLICATIONS, 2009, 5902 : 306 - 320
  • [49] Dataset for file fragment classification of audio file formats
    Atieh Khodadadi
    Mehdi Teimouri
    BMC Research Notes, 12
  • [50] Denoising Multi-Source Weak Supervision for Neural Text Classification
    Ren, Wendi
    Li, Yinghao
    Su, Hanting
    Kartchner, David
    Mitchell, Cassie
    Zhang, Chao
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020,