Weak Labelling for File -level Source Code Classification

被引:2
|
作者
Sas, Cezar [1 ]
Capiluppi, Andrea [1 ]
机构
[1] Univ Groningen, Bernoulli Inst, Groningen, Netherlands
关键词
software classification; software categories keywords; semantic reverse engineering;
D O I
10.1109/SANER56733.2023.00074
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Software repository hosting services contain large amounts of open-source software, with GitHub hosting over 200 million repositories, from new to established ones. However, these repositories are not easy to find, calling for various attempts to classify their application domains automatically. However, most proposed approaches use artifacts, like README files, as a proxy for the project, losing the information in the source code and the interaction between files. Furthermore, they all focus on the project-level, ignoring the decomposition of software projects into components and modules. This work presents a weak labelling approach based on keyword extraction to annotate source files in a software project. Our findings suggest that using keywords to perform file-level annotations is an effective approach that can capture enough information from the source file so that new labels can be predicted. The long-term goal of our research is to classify source code files and use these annotations to identify semantic components in software projects. In addition, these annotations can be used for semantic reverse engineering, software reuse, and more. We plan to train machine learning models that use our proposed weak supervision to better annotate source files inside software projects.
引用
收藏
页码:698 / 702
页数:5
相关论文
共 50 条
  • [21] High level language translator with machine code as representation of the source code
    Ribic, Samir
    Salihbegovic, Adnan
    PROCEEDINGS OF THE ITI 2007 29TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY INTERFACES, 2007, : 777 - +
  • [22] Source Code Quality Classification Based On Software Metrics
    Vytovtov, Petr
    Markov, Evgeny
    PROCEEDINGS OF THE 20TH CONFERENCE OF OPEN INNOVATIONS ASSOCIATION (FRUCT 2017), 2017, : 505 - 511
  • [23] Code Authority Control Method Based on File Security Level and ACL in WebIDE
    Li, Junhuai
    Ji, Wenchao
    Tian, Jubo
    Wang, Huaijun
    Wang, Kan
    2018 IEEE INTERNATIONAL CONFERENCE OF INTELLIGENT ROBOTICS AND CONTROL ENGINEERING (IRCE), 2018, : 193 - 197
  • [24] FILE CLASSIFICATION
    不详
    CANADIAN MEDICAL ASSOCIATION JOURNAL, 1979, 120 (04) : 498 - 500
  • [25] Deductive Binary Code Verification Against Source-Code-Level Specifications
    Kamkin, Alexander
    Khoroshilov, Alexey
    Kotsynyak, Artem
    Putro, Pavel
    TESTS AND PROOFS (TAP 2020), 2020, 12165 : 43 - 58
  • [26] Sentence-level Sentiment Classification with Weak Supervision
    Wu, Fangzhao
    Zhang, Jia
    Yuan, Zhigang
    Wu, Sixing
    Huang, Yongfeng
    Yan, Jun
    SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 973 - 976
  • [27] Erasure Code of Small File in a Distributed File System
    Chen, Xinhai
    Liu, Jie
    Xie, Peizhen
    PROCEEDINGS OF 2017 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2017, : 2549 - 2554
  • [28] Improving file-level fuzzy hashes for malware variant classification
    Shiel, Ian
    O'Shaughnessy, Stephen
    DIGITAL INVESTIGATION, 2019, 28 : S88 - S94
  • [29] Research and Practice on Education of SQA at Source Code Level
    Wang, Yan-Qing
    Qi, Zhong-Ying
    Zhang, Li-Jie
    Song, Min-Jing
    INTERNATIONAL JOURNAL OF ENGINEERING EDUCATION, 2011, 27 (01) : 70 - 76
  • [30] Analyzing Software Engineering Processes on Source Code Level
    Wilking, Dirk
    Kowalewski, Stefan
    NEW TRENDS IN SOFTWARE METHODOLOGIES, TOOLS AND TECHNIQUES, 2007, 161 : 305 - 314