Software provenance tracking at the scale of public source code

被引:15
|
作者
Rousseau, Guillaume [1 ]
Di Cosmo, Roberto [1 ,2 ]
Zacchiroli, Stefano [1 ,2 ]
机构
[1] Univ Paris, Paris, France
[2] INRIA, Paris, France
关键词
Software evolution; Open source; Clone detection; Source code tracking; Mining software repositories; Provenance tracking;
D O I
10.1007/s10664-020-09828-5
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.
引用
收藏
页码:2930 / 2959
页数:30
相关论文
共 50 条
  • [41] The Impact of Structural Source Code Changes on Software Quality
    Gerlec, Crt
    Hericko, Marjan
    NUMERICAL ANALYSIS AND APPLIED MATHEMATICS (ICNAAM 2012), VOLS A AND B, 2012, 1479 : 470 - 473
  • [42] From Source Code Analysis to Static Software Testing
    Wang Wei
    Han Lilong
    Meng Yunxiu
    Bai He
    PROCEEDINGS OF 2014 IEEE WORKSHOP ON ADVANCED RESEARCH AND TECHNOLOGY IN INDUSTRY APPLICATIONS (WARTIA), 2014, : 1280 - 1283
  • [43] Software visualizations for improving and measuring the comprehensibility of source code
    Umphress, DA
    Hendrix, TD
    Cross, JH
    Maghsoodloo, S
    SCIENCE OF COMPUTER PROGRAMMING, 2006, 60 (02) : 121 - 133
  • [44] An effective source code review process for embedded software
    Hirayama, Masayuki
    Ohno, Katsumi
    Kawai, Nao
    Tamaru, Kichiro
    Monden, Hiroshi
    PRODUCT-FOCUSED SOFTWARE PROCESS IMPROVEMENT, PROCEEDINGS, 2006, 4034 : 47 - 60
  • [45] Code Forking, Governance, and Sustainability in Open Source Software
    Nyman, Linus
    Lindman, Juho
    TECHNOLOGY INNOVATION MANAGEMENT REVIEW, 2013, : 7 - 12
  • [46] Copyrights Expression and Secure Container of Software Source Code
    Cha, ByungRae
    Park, Sun
    NCM 2008: 4TH INTERNATIONAL CONFERENCE ON NETWORKED COMPUTING AND ADVANCED INFORMATION MANAGEMENT, VOL 2, PROCEEDINGS, 2008, : 325 - 332
  • [47] Software system comparison with semantic source code embeddings
    Sašo Karakatič
    Aleksej Miloševič
    Tjaša Heričko
    Empirical Software Engineering, 2022, 27
  • [48] Source code transformation based on software cost analysis
    Chung, EY
    Benini, L
    De Micheli, G
    ISSS'01: 14TH INTERNATIONAL SYMPOSIUM ON SYSTEM SYNTHESIS, 2001, : 153 - 158
  • [49] Approach to Searching Software Source Code with Graph Embedding
    Ling C.-Y.
    Zou Y.-Z.
    Lin Z.-Q.
    Xie B.
    Zhao J.-F.
    Ruan Jian Xue Bao/Journal of Software, 2019, 30 (05): : 1481 - 1497
  • [50] SCOBA: Source Code Based Attestation on Custom Software
    Gu, Liang
    Guo, Yao
    Ruan, Anbang
    Shen, Qingni
    Mei, Hong
    26TH ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE (ACSAC 2010), 2010, : 337 - 346