Softcite dataset: A dataset of software mentions in biomedical and economic research publications

被引:17
|
作者
Du, Caifan [1 ]
Cohoon, Johanna [1 ]
Lopez, Patrice [2 ]
Howison, James [1 ]
机构
[1] Univ Texas Austin, 1616 Guadalupe St, Austin, TX 78701 USA
[2] SCI MINER, Naves, France
关键词
IMPACT; PROVENANCE; AGREEMENT; SCIENCE;
D O I
10.1002/asi.24454
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold-standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.
引用
收藏
页码:870 / 884
页数:15
相关论文
共 50 条
  • [1] DMDD: A Large-Scale Dataset for Dataset Mentions Detection
    Pan, Huitong
    Zhang, Qi
    Dragut, Eduard
    Caragea, Cornelia
    Latecki, Longin Jan
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1132 - 1146
  • [2] PubMedQA: A Dataset for Biomedical Research Question Answering
    Jin, Qiao
    Dhingra, Bhuwan
    Liu, Zhengping
    Cohen, William W.
    Lu, Xinghua
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2567 - 2577
  • [3] Biomedical Dataset Recommendation
    Wang, Xu
    van Harmelen, Frank
    Huang, Zhisheng
    [J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2021, : 192 - 199
  • [4] TDMentions: A Dataset of Technical Debt Mentions in Online Posts
    Ericsson, Morgan
    Wingkvist, Anna
    [J]. 2019 IEEE/ACM INTERNATIONAL CONFERENCE ON TECHNICAL DEBT (TECHDEBT 2019), 2019, : 123 - 124
  • [5] An Unabridged Source Code Dataset for Research in Software Reuse
    Janjic, Werner
    Hummel, Oliver
    Schumacher, Marcus
    Atkinson, Colin
    [J]. 2013 10TH IEEE WORKING CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR), 2013, : 339 - 342
  • [6] The Role of Biomedical Dataset in Classification
    Tanwani, Ajay Kumar
    Farooq, Muddassar
    [J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, PROCEEDINGS, 2009, 5651 : 370 - 374
  • [7] DISCO: A Dataset of Discord Chat Conversations for Software Engineering Research
    Subash, Keerthana Muthu
    Kumar, Lakshmi Prasanna
    Vadlamani, Sri Lakshmi
    Chatterjee, Preetha
    Baysal, Olga
    [J]. 2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 227 - 231
  • [8] SecretBench: A Dataset of Software Secrets
    Basak, Setu Kumar
    Neil, Lorenzo
    Reaves, Bradley
    Williams, Laurie
    [J]. 2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 347 - 351
  • [9] Dataset Debt in Biomedical Language Modeling
    Fries, Jason
    Seelam, Natasha
    Altay, Gabriel
    Weber, Leon
    Kang, Myungsun
    Datta, Debajyoti
    Su, Ruisi
    Garda, Samuele
    Wang, Bo
    Ott, Simon
    Samwald, Matthias
    Kusa, Wojciech
    [J]. PROCEEDINGS OF WORKSHOP ON CHALLENGES & PERSPECTIVES IN CREATING LARGE LANGUAGE MODELS (BIGSCIENCE EPISODE #5), 2022, : 137 - 145
  • [10] BAND: Biomedical Alert News Dataset
    Fu, Zihao
    Zhang, Meiru
    Meng, Zaiqiao
    Shen, Yannan
    Buckeridge, David
    Collier, Nigel
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18012 - 18020