Curating GitHub for engineered software projects

被引:1
|
作者
Nuthan Munaiah
Steven Kroh
Craig Cabrey
Meiyappan Nagappan
机构
[1] Rochester Institute of Technology,Department of Software Engineering
[2] University of Waterloo,David R. Cheriton School of Computer Science
来源
关键词
Mining software repositories; GitHub; Data curation; Curation tools;
D O I
暂无
中图分类号
学科分类号
摘要
Software forges like GitHub host millions of repositories. Software engineering researchers have been able to take advantage of such a large corpora of potential study subjects with the help of tools like GHTorrent and Boa. However, the simplicity in querying comes with a caveat: there are limited means of separating the signal (e.g. repositories containing engineered software projects) from the noise (e.g. repositories containing home work assignments). The proportion of noise in a random sample of repositories could skew the study and may lead to researchers reaching unrealistic, potentially inaccurate, conclusions. We argue that it is imperative to have the ability to sieve out the noise in such large repository forges. We propose a framework, and present a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project. We identify software engineering practices (called dimensions) and propose means for validating their existence in a GitHub repository. We used reaper to measure the dimensions of 1,857,423 GitHub repositories. We then used manually classified data sets of repositories to train classifiers capable of predicting if a given GitHub repository contains an engineered software project. The performance of the classifiers was evaluated using a set of 200 repositories with known ground truth classification. We also compared the performance of the classifiers to other approaches to classification (e.g. number of GitHub Stargazers) and found our classifiers to outperform existing approaches. We found stargazers-based classifier (with 10 as the threshold for number of stargazers) to exhibit high precision (97%) but an inversely proportional recall (32%). On the other hand, our best classifier exhibited a high precision (82%) and a high recall (86%). The stargazer-based criteria offers precision but fails to recall a significant portion of the population.
引用
收藏
页码:3219 / 3253
页数:34
相关论文
共 50 条
  • [1] Curating GitHub for engineered software projects
    Munaiah, Nuthan
    Kroh, Steven
    Cabrey, Craig
    Nagappan, Meiyappan
    EMPIRICAL SOFTWARE ENGINEERING, 2017, 22 (06) : 3219 - 3253
  • [2] PHANTOM: Curating GitHub for engineered software projects using time-series clustering
    Peter Pickerill
    Heiko Joshua Jungen
    Mirosław Ochodek
    Michał Maćkowiak
    Miroslaw Staron
    Empirical Software Engineering, 2020, 25 : 2897 - 2929
  • [3] PHANTOM: Curating GitHub for engineered software projects using time-series clustering
    Pickerill, Peter
    Jungen, Heiko Joshua
    Ochodek, Miroslaw
    Mackowiak, Michal
    Staron, Miroslaw
    EMPIRICAL SOFTWARE ENGINEERING, 2020, 25 (04) : 2897 - 2929
  • [4] Wasmizer: Curating WebAssembly-driven Projects on GitHub
    Nicholson, Alexander
    Stievenart, Quentin
    Mazidi, Arash
    Ghafari, Mohammad
    2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 130 - 141
  • [5] REPERSP: Recommending Personalized Software Projects on GitHub
    Xu, Wenyuan
    Sun, Xiaobing
    Hu, Jiajun
    Li, Bin
    2017 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), 2017, : 648 - 652
  • [6] Evolution Model of Open-Source Software Projects in GitHub
    Wang, Hongbing
    Ji, Haoran
    2022 2ND IEEE INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND ARTIFICIAL INTELLIGENCE (SEAI 2022), 2022, : 135 - 145
  • [7] Code of Conduct Conversations in Open Source Software Projects on Github
    Li, Renee
    Pandurangan, Pavitthra
    Frluckaj, Hana
    Dabbish, Laura
    Proceedings of the ACM on Human-Computer Interaction, 2021, 5 (CSCW1)
  • [8] Social Diversity and Growth Levels of Open Source Software Projects on GitHub
    Aue, Joop
    Haisma, Michiel
    Tomasdottir, Kristin Fjola
    Bacchelli, Alberto
    ESEM'16: PROCEEDINGS OF THE 10TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT, 2016,
  • [9] Understanding Language Selection in Multi-Language Software Projects on GitHub
    Li, Wen
    Meng, Na
    Li, Li
    Cai, Haipeng
    2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2021), 2021, : 256 - 257
  • [10] GitHub Projects. Quality Analysis of Open-Source Software
    Jarczyk, Oskar
    Gruszka, Blazej
    Jaroszewicz, Szymon
    Bukowski, Leszek
    Wierzbicki, Adam
    SOCIAL INFORMATICS, SOCINFO 2014, 2014, 8851 : 80 - 94