It's Not Just GitHub: Identifying Data and Software Sources Included in Publications

被引:2
|
作者
Escamilla, Emily [1 ]
Salsabil, Lamia [1 ]
Klein, Martin [2 ]
Wu, Jian [1 ]
Weigle, Michele C. [1 ]
Nelson, Michael L. [1 ]
机构
[1] Old Dominion Univ, Norfolk, VA USA
[2] Los Alamos Natl Lab, Los Alamos, NM 87544 USA
关键词
Web Archiving; GitHub; arXiv; Digital Preservation; Memento; Open Source Software;
D O I
10.1007/978-3-031-43849-3_17
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Paper publications are no longer the only form of research product. Due to recent initiatives by publication venues and funding institutions, open access datasets and software products are increasingly considered research products and URIs to these products are growing more prevalent in scholarly publications. However, as with all URIs, resources found on the live Web are not permanent. Archivists and institutions including Software Heritage, Internet Archive, and Zenodo are working to preserve data and software products as valuable parts of reproducibility, a cornerstone of scientific research. While some hosting platforms are well-known and can be identified with regular expressions, there are a vast number of smaller, more niche hosting platforms utilized by researchers to host their data and software. If it is not feasible to manually identify all hosting platforms used by researchers, how can we identify URIs to open-access data and software (OADS) to aid in their preservation? We used a hybrid classifier to classify URIs as OADS URIs and non-OADS URIs. We found that URIs to Git hosting platforms (GHPs) including GitHub, GitLab, SourceForge, and Bitbucket accounted for 33% of OADS URIs. Non-GHP OADS URIs are distributed across almost 50,000 unique hostnames. We determined that using a hybrid classifier allows for the identification of OADS URIs in less common hosting platforms which can benefit discoverability for preserving datasets and software products as research products for reproducibility.
引用
收藏
页码:195 / 206
页数:12
相关论文
共 50 条
  • [21] Identifying disease genes by integrating multiple data sources
    Chen, Bolin
    Wang, Jianxin
    Li, Min
    Wu, Fang-Xiang
    BMC MEDICAL GENOMICS, 2014, 7
  • [22] Teaching Software Defined Networking: It's not just coding
    Cosgrove, Steve
    PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCE ON TEACHING, ASSESSMENT, AND LEARNING FOR ENGINEERING (TALE), 2016, : 139 - 144
  • [23] Data Sources and Analysis Processes for Identifying Emerging Trends
    Transportation Research Board - Special Report, 2022, (344): : 47 - 76
  • [24] Identifying Irregularity Sources by Automated Location Vehicle Data
    Mozzoni, Sara
    Murru, Roberto
    Barabino, Benedetto
    20TH EURO WORKING GROUP ON TRANSPORTATION MEETING, EWGT 2017, 2017, 27 : 1179 - 1186
  • [25] Methods and data sources for identifying members of a regulated community
    Pittman, William C.
    Han, Zhe
    Harding, Brian Z.
    Jiang, Jiaojun
    Rosas, Camilo
    Pineda, Alba
    Mannan, M. Sam
    PROCESS SAFETY PROGRESS, 2016, 35 (01) : 47 - 52
  • [26] Identifying Patient Readmissions: Are Our Data Sources Misleading?
    Daddato, Andrea E.
    Dollar, Blythe
    Lum, Hillary D.
    Burke, Robert E.
    Boxer, Rebecca S.
    JOURNAL OF THE AMERICAN MEDICAL DIRECTORS ASSOCIATION, 2019, 20 (08) : 1042 - 1044
  • [27] Identifying disease genes by integrating multiple data sources
    Bolin Chen
    Jianxin Wang
    Min Li
    Fang-Xiang Wu
    BMC Medical Genomics, 7
  • [28] Combining GitHub, Chat, and Peer Evaluation Data to Assess Individual Contributions to Team Software Development Projects
    Hundhausen, Christopher
    Conrad, Phill
    Adesope, Olusola
    Tariq, Ahsun
    ACM TRANSACTIONS ON COMPUTING EDUCATION, 2023, 23 (03)
  • [29] DIRECTORY OF BRITISH OFFICIAL PUBLICATIONS - A GUIDE TO SOURCES - RICHARD,S
    ALBERTH, L
    ZEITSCHRIFT FUR BIBLIOTHEKSWESEN UND BIBLIOGRAPHIE, 1982, 29 (06): : 502 - 503
  • [30] DIRECTORY OF BRITISH OFFICIAL PUBLICATIONS - A GUIDE TO SOURCES - RICHARD,S
    HORROCKS, N
    CANADIAN LIBRARY JOURNAL, 1983, 40 (01): : 39 - 39