Software provenance tracking at the scale of public source code

被引:15
|
作者
Rousseau, Guillaume [1 ]
Di Cosmo, Roberto [1 ,2 ]
Zacchiroli, Stefano [1 ,2 ]
机构
[1] Univ Paris, Paris, France
[2] INRIA, Paris, France
关键词
Software evolution; Open source; Clone detection; Source code tracking; Mining software repositories; Provenance tracking;
D O I
10.1007/s10664-020-09828-5
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.
引用
收藏
页码:2930 / 2959
页数:30
相关论文
共 50 条
  • [31] The Comment Density of Open Source Software Code
    Arafat, Oliver
    Richle, Dirk
    2009 31ST INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, COMPANION VOLUME, 2009, : 195 - +
  • [32] Practical Emulation of Software Defects in Source Code
    Pereira, Goncalo
    Barbosa, Raul
    Madeira, Henrique
    2016 12TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC 2016), 2016, : 130 - 140
  • [33] Software Documentation through Source Code Annotations
    Nosal, Milan
    Porubaen, Jaroslav
    INFORMATICS 2013: PROCEEDINGS OF THE TWELFTH INTERNATIONAL CONFERENCE ON INFORMATICS, 2013, : 180 - 185
  • [34] Free and Open Source Software organizations: A large-scale analysis of code, comments, and commits frequency
    Chelkowski, Tadeusz
    Jemielniak, Dariusz
    Macikowski, Kacper
    PLOS ONE, 2021, 16 (09):
  • [35] Software Developer Activity as a Source for Identifying Hidden Source Code Dependencies
    Konopka, Martin
    Bielikova, Maria
    SOFSEM 2015: THEORY AND PRACTICE OF COMPUTER SCIENCE, 2015, 8939 : 449 - 462
  • [36] A Defect Tracking Tool for Open Source Software
    Malhotra, Ruchika
    Bahl, Laavanye
    2017 2ND INTERNATIONAL CONFERENCE FOR CONVERGENCE IN TECHNOLOGY (I2CT), 2017, : 901 - 905
  • [37] Tracking Patches for Open Source Software Vulnerabilities
    Xu, Congying
    Chen, Bihuan
    Lu, Chenhao
    Huang, Kaifeng
    Peng, Xin
    Liu, Yang
    PROCEEDINGS OF THE 30TH ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2022, 2022, : 860 - 871
  • [38] Exploring the Relationships between Software Architecture and Source Code
    Tian, Fangchao
    Liang, Peng
    2017 24TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE WORKSHOPS (APSECW), 2017, : 15 - 16
  • [39] Mapping software design changes to source code changes
    Tan, Xiangchen
    Feng, Tie
    Zhang, Jiachen
    SNPD 2007: EIGHTH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING, AND PARALLEL/DISTRIBUTED COMPUTING, VOL 2, PROCEEDINGS, 2007, : 650 - +
  • [40] Source Code Verification Tools for Software Security Bugs
    Michaud, Frederic
    Painchaud, Frederic
    NEW TRENDS IN SOFTWARE METHODOLOGIES, TOOLS AND TECHNIQUES, 2006, 147 : 231 - 241