Software provenance tracking at the scale of public source code

被引:15
|
作者
Rousseau, Guillaume [1 ]
Di Cosmo, Roberto [1 ,2 ]
Zacchiroli, Stefano [1 ,2 ]
机构
[1] Univ Paris, Paris, France
[2] INRIA, Paris, France
关键词
Software evolution; Open source; Clone detection; Source code tracking; Mining software repositories; Provenance tracking;
D O I
10.1007/s10664-020-09828-5
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.
引用
收藏
页码:2930 / 2959
页数:30
相关论文
共 50 条
  • [1] Software provenance tracking at the scale of public source code
    Guillaume Rousseau
    Roberto Di Cosmo
    Stefano Zacchiroli
    Empirical Software Engineering, 2020, 25 : 2930 - 2959
  • [2] Malware Provenance: Code Reuse Detection in Malicious Software at Scale
    Upchurch, Jason
    Zhou, Xiaobo
    2016 11TH INTERNATIONAL CONFERENCE ON MALICIOUS AND UNWANTED SOFTWARE (MALWARE), 2016, : 101 - 109
  • [3] Provenance Tracking in the LHCb Software
    Trisovic A.
    Jones C.R.
    Couturier B.
    Clemencic M.
    Barba L.A.
    Thiruvathukal G.K.
    Computing in Science and Engineering, 2020, 22 (02): : 88 - 94
  • [4] Using the uniqueness of global identifiers to determine the provenance of Python software source code
    Yiming Sun
    Daniel German
    Stefano Zacchiroli
    Empirical Software Engineering, 2023, 28
  • [5] Using the uniqueness of global identifiers to determine the provenance of Python']Python software source code
    Sun, Yiming
    German, Daniel
    Zacchiroli, Stefano
    EMPIRICAL SOFTWARE ENGINEERING, 2023, 28 (05)
  • [6] LHDiff: Tracking Source Code Lines To Support Software Maintenance Activities
    Asaduzzaman, Muhammad
    Roy, Chanchal K.
    Schneider, Kevin A.
    Di Penta, Massimiliano
    2013 29TH IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE (ICSM), 2013, : 484 - 487
  • [7] Automatically Adapting Source Code to Document Provenance
    Miles, Simon
    PROVENANCE AND ANNOTATION OF DATA AND PROCESSES, 2010, 6378 : 102 - 110
  • [8] Intelligent Code Review Assignment for Large Scale Open Source Software Stacks
    Aryendu, Ishan
    Wang, Ying
    Elkourdi, Farah
    AlOmar, Eman
    PROCEEDINGS OF THE 37TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE 2022, 2022,
  • [9] Task Articulation in Software Maintenance: Integrating Source Code Annotations with an Issue Tracking System
    Anvik, John
    Storey, Margaret-Anne
    2008 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE, 2008, : 460 - 461
  • [10] Tracking code clones in evolving software
    Duala-Ekoko, Ekwa
    Robillard, Martin P.
    ICSE 2007: 29TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, PROCEEDINGS, 2007, : 158 - +