It's Not Just GitHub: Identifying Data and Software Sources Included in Publications

被引:2
|
作者
Escamilla, Emily [1 ]
Salsabil, Lamia [1 ]
Klein, Martin [2 ]
Wu, Jian [1 ]
Weigle, Michele C. [1 ]
Nelson, Michael L. [1 ]
机构
[1] Old Dominion Univ, Norfolk, VA USA
[2] Los Alamos Natl Lab, Los Alamos, NM 87544 USA
关键词
Web Archiving; GitHub; arXiv; Digital Preservation; Memento; Open Source Software;
D O I
10.1007/978-3-031-43849-3_17
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Paper publications are no longer the only form of research product. Due to recent initiatives by publication venues and funding institutions, open access datasets and software products are increasingly considered research products and URIs to these products are growing more prevalent in scholarly publications. However, as with all URIs, resources found on the live Web are not permanent. Archivists and institutions including Software Heritage, Internet Archive, and Zenodo are working to preserve data and software products as valuable parts of reproducibility, a cornerstone of scientific research. While some hosting platforms are well-known and can be identified with regular expressions, there are a vast number of smaller, more niche hosting platforms utilized by researchers to host their data and software. If it is not feasible to manually identify all hosting platforms used by researchers, how can we identify URIs to open-access data and software (OADS) to aid in their preservation? We used a hybrid classifier to classify URIs as OADS URIs and non-OADS URIs. We found that URIs to Git hosting platforms (GHPs) including GitHub, GitLab, SourceForge, and Bitbucket accounted for 33% of OADS URIs. Non-GHP OADS URIs are distributed across almost 50,000 unique hostnames. We determined that using a hybrid classifier allows for the identification of OADS URIs in less common hosting platforms which can benefit discoverability for preserving datasets and software products as research products for reproducibility.
引用
收藏
页码:195 / 206
页数:12
相关论文
共 50 条
  • [1] Identifying experts in software libraries and frameworks among GitHub Users
    Montandon, Joao Eduardo
    Lourdes Silva, Luciana
    Valente, Marco Tulio
    IEEE International Working Conference on Mining Software Repositories, 2019, 2019-May : 276 - 287
  • [2] Mining Software Engineering Data from GitHub
    Gousios, Georgios
    Spinellis, Diomidis
    PROCEEDINGS OF THE 2017 IEEE/ACM 39TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C 2017), 2017, : 501 - 502
  • [3] The Measurement of the Software Ecosystem's Productivity with GitHub
    Liao, Zhifang
    Zhao, Yiqi
    Liu, ShengZong
    Zhang, Yan
    Liu, Limin
    Long, Jun
    COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2021, 36 (01): : 239 - 258
  • [4] What’s in a GitHub repository? - A software documentation perspective
    Venigalla, Akhila Sri Manasa
    Sridhar, Chimalakonda
    arXiv, 2021,
  • [5] Funding sources in top Software Engineering conference publications
    Kapitsaki, Georgia M.
    Papoutsoglou, Maria
    PROCEEDINGS OF THE 2023 30TH ASIA-PACIFIC SOFTWARE ENGINEERING CONFERENCE, APSEC 2023, 2023, : 649 - 650
  • [6] RepoSkillMiner: Identifying software expertise from GitHub repositories using Natural Language Processing
    Kourtzanidis, Stratos
    Chatzigeorgiou, Alexander
    Ampatzoglou, Apostolos
    2020 35TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2020), 2020, : 1353 - 1357
  • [7] Identifying data sources for data warehouses
    Koncilia, C
    Pozewaunig, H
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2002, 2002, 2412 : 213 - 218
  • [8] Analysis of Intercrossed Open-Source Software Repositories Data in GitHub
    Farah, Gabriel
    Correal, Dario
    2013 8TH COMPUTING COLOMBIAN CONFERENCE (8CCC), 2013, : 37 - 42
  • [9] It's not just CMM software
    Woodbine, Ken
    Quality, 2007, 46 (01): : 52 - 59
  • [10] Data without software are just numbers
    Davenport J.H.
    Grant J.
    Jones C.M.
    Data Science Journal, 2020, 19 (01)