Softcite dataset: A dataset of software mentions in biomedical and economic research publications

被引:17
|
作者
Du, Caifan [1 ]
Cohoon, Johanna [1 ]
Lopez, Patrice [2 ]
Howison, James [1 ]
机构
[1] Univ Texas Austin, 1616 Guadalupe St, Austin, TX 78701 USA
[2] SCI MINER, Naves, France
关键词
IMPACT; PROVENANCE; AGREEMENT; SCIENCE;
D O I
10.1002/asi.24454
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold-standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.
引用
收藏
页码:870 / 884
页数:15
相关论文
共 50 条
  • [41] The software heritage license dataset (2022 edition)
    Jesus M. Gonzalez-Barahona
    Sergio Montes-Leon
    Gregorio Robles
    Stefano Zacchiroli
    [J]. Empirical Software Engineering, 2023, 28
  • [42] New Dataset for Software Defect Prediction Model
    Alsaraireh, Jameel
    Agoyi, Mary
    [J]. 2022 10TH INTERNATIONAL CONFERENCE ON SMART GRID, ICSMARTGRID, 2022, : 306 - 308
  • [43] eHomeSeniors Dataset: An Infrared Thermal Sensor Dataset for Automatic Fall Detection Research
    Riquelme, Fabian
    Espinoza, Cristina
    Rodenas, Tomas
    Minonzio, Jean-Gabriel
    Taramasco, Carla
    [J]. SENSORS, 2019, 19 (20)
  • [44] Dataset Reuse: An Analysis of References in Community Discussions, Publications and Data
    Endris, Kemele M.
    Gimenez-Garcia, Jose M.
    Thakkar, Harsh
    Demidova, Elena
    Zimmermann, Antoine
    Lange, Christoph
    Simperl, Elena
    [J]. K-CAP 2017: PROCEEDINGS OF THE KNOWLEDGE CAPTURE CONFERENCE, 2017,
  • [45] Golos: Russian Dataset for Speech Research
    Karpov, Nikolay
    Denisenko, Alexander
    Minkin, Fedor
    [J]. INTERSPEECH 2021, 2021, : 1419 - 1423
  • [46] University-Industry Collaboration and Open Source Software (OSS) Dataset in Mining Software Repositories (MSR) Research
    Tripathi, Ambika
    Dabral, Savita
    Sureka, Ashish
    [J]. 2015 IEEE 1ST INTERNATIONAL WORKSHOP ON SOFTWARE ANALYTICS (SWAN), 2015, : 39 - 40
  • [47] DroneFace: An Open Dataset for Drone Research
    Hsu, Hwai-Jung
    Chen, Kuan-Ta
    [J]. PROCEEDINGS OF THE 8TH ACM MULTIMEDIA SYSTEMS CONFERENCE (MMSYS'17), 2017, : 187 - 192
  • [48] STRUM: A new Dataset for Neuroergonomics Research
    Kothe, Christian A.
    Mullen, Tim R.
    Makeig, Scott
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2018, : 77 - 82
  • [49] The Poses for Equine Research Dataset (PFERD)
    Li, Ci
    Mellbin, Ylva
    Krogager, Johanna
    Polikovsky, Senya
    Holmberg, Martin
    Ghorbani, Nima
    Black, Michael J.
    Kjellstrom, Hedvig
    Zuffi, Silvia
    Hernlund, Elin
    [J]. SCIENTIFIC DATA, 2024, 11 (01)
  • [50] GIR dataset: A geometry and real impulse response dataset for machine learning research in acoustics
    Xydis, Achilleas
    Perraudin, Nathanael
    Rust, Romana
    Heutschi, Kurt
    Casas, Gonzalo
    Grognuz, Oksana Riba
    Eggenschwiler, Kurt
    Kohler, Matthias
    Perez-Cruz, Fernando
    [J]. APPLIED ACOUSTICS, 2023, 208