Software provenance tracking at the scale of public source code

被引:15
|
作者
Rousseau, Guillaume [1 ]
Di Cosmo, Roberto [1 ,2 ]
Zacchiroli, Stefano [1 ,2 ]
机构
[1] Univ Paris, Paris, France
[2] INRIA, Paris, France
关键词
Software evolution; Open source; Clone detection; Source code tracking; Mining software repositories; Provenance tracking;
D O I
10.1007/s10664-020-09828-5
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.
引用
收藏
页码:2930 / 2959
页数:30
相关论文
共 50 条
  • [21] A Code Provenance Management Tool for IP-Aware Software Development
    Dang, Ya Bin
    Cheng, Ping
    Luo, Lin
    Cho, Adrian
    ICSE'08 PROCEEDINGS OF THE THIRTIETH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, 2008, : 975 - 976
  • [22] Reconciling software architecture and source code in support of software evolution
    Haitzer, Thomas
    Navarro, Elena
    Zdun, Uwe
    JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 123 : 119 - 144
  • [23] Research on Network Malicious Code Detection and provenance tracking in Future Network
    Liu Lan
    Lin Jun
    Wang Qiang
    Xu Xiaoping
    2018 IEEE 18TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY COMPANION (QRS-C), 2018, : 264 - 268
  • [24] ENDGAME SOURCE CODE GOES PUBLIC
    STILLER, L
    ICCA JOURNAL, 1992, 15 (02): : 107 - 107
  • [25] Source Code Metrics for Software Defects Prediction
    Rebro, Dominik Arne
    Rossi, Bruno
    Chren, Stanislav
    38TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, SAC 2023, 2023, : 1469 - 1472
  • [26] A Framework of Code Reuse in Open Source Software
    Tung, Yuan-Hsin
    Chuang, Chih-Ju
    Shan, Hwai-Ling
    2014 16TH ASIA-PACIFIC NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM (APNOMS), 2014,
  • [27] Archiving and Referencing Source Code with Software Heritage
    Di Cosmo, Roberto
    MATHEMATICAL SOFTWARE - ICMS 2020, 2020, 12097 : 362 - 373
  • [28] Software model checking without source code
    Chaki, Sagar
    Ivers, James
    INNOVATIONS IN SYSTEMS AND SOFTWARE ENGINEERING, 2010, 6 (03) : 233 - 242
  • [29] Source Code Comprehension Analysis in Software Maintenance
    Al-Saiyd, Nedhal A.
    2017 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION SYSTEMS (ICCCS2017), 2017, : 1 - 5
  • [30] Supporting software documentation with source code summarization
    Al-Msie'deen, Ra'Fat
    Blasi, Anas H.
    INTERNATIONAL JOURNAL OF ADVANCED AND APPLIED SCIENCES, 2019, 6 (01): : 59 - 67