Softcite dataset: A dataset of software mentions in biomedical and economic research publications

被引：17

作者：

Du, Caifan ^{[1
]}

Cohoon, Johanna ^{[1
]}

Lopez, Patrice ^{[2
]}

Howison, James ^{[1
]}

机构：

[1] Univ Texas Austin, 1616 Guadalupe St, Austin, TX 78701 USA

[2] SCI MINER, Naves, France

来源：

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY | 2021年 / 72卷 / 07期

关键词：

IMPACT; PROVENANCE; AGREEMENT; SCIENCE;

D O I：

10.1002/asi.24454

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Software contributions to academic research are relatively invisible, especially to the formalized scholarly reputation system based on bibliometrics. In this article, we introduce a gold-standard dataset of software mentions from the manual annotation of 4,971 academic PDFs in biomedicine and economics. The dataset is intended to be used for automatic extraction of software mentions from PDF format research publications by supervised learning at scale. We provide a description of the dataset and an extended discussion of its creation process, including improved text conversion of academic PDFs. Finally, we reflect on our challenges and lessons learned during the dataset creation, in hope of encouraging more discussion about creating datasets for machine learning use.

引用

页码：870 / 884

页数：15

共 50 条

[1] DMDD: A Large-Scale Dataset for Dataset Mentions Detection
Pan, Huitong
Zhang, Qi
Dragut, Eduard
Caragea, Cornelia
Latecki, Longin Jan
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1132 - 1146
[2] PubMedQA: A Dataset for Biomedical Research Question Answering
Jin, Qiao
Dhingra, Bhuwan
Liu, Zhengping
Cohen, William W.
Lu, Xinghua
[J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 2567 - 2577
[3] Biomedical Dataset Recommendation
Wang, Xu
van Harmelen, Frank
Huang, Zhisheng
[J]. PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE ON DATA SCIENCE, TECHNOLOGY AND APPLICATIONS (DATA), 2021, : 192 - 199
[4] TDMentions: A Dataset of Technical Debt Mentions in Online Posts
Ericsson, Morgan
Wingkvist, Anna
[J]. 2019 IEEE/ACM INTERNATIONAL CONFERENCE ON TECHNICAL DEBT (TECHDEBT 2019), 2019, : 123 - 124
[5] An Unabridged Source Code Dataset for Research in Software Reuse
Janjic, Werner
Hummel, Oliver
Schumacher, Marcus
Atkinson, Colin
[J]. 2013 10TH IEEE WORKING CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR), 2013, : 339 - 342
[6] The Role of Biomedical Dataset in Classification
Tanwani, Ajay Kumar
Farooq, Muddassar
[J]. ARTIFICIAL INTELLIGENCE IN MEDICINE, PROCEEDINGS, 2009, 5651 : 370 - 374
[7] DISCO: A Dataset of Discord Chat Conversations for Software Engineering Research
Subash, Keerthana Muthu
Kumar, Lakshmi Prasanna
Vadlamani, Sri Lakshmi
Chatterjee, Preetha
Baysal, Olga
[J]. 2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 227 - 231
[8] SecretBench: A Dataset of Software Secrets
Basak, Setu Kumar
Neil, Lorenzo
Reaves, Bradley
Williams, Laurie
[J]. 2023 IEEE/ACM 20TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2023, : 347 - 351
[9] Dataset Debt in Biomedical Language Modeling
Fries, Jason
Seelam, Natasha
Altay, Gabriel
Weber, Leon
Kang, Myungsun
Datta, Debajyoti
Su, Ruisi
Garda, Samuele
Wang, Bo
Ott, Simon
Samwald, Matthias
Kusa, Wojciech
[J]. PROCEEDINGS OF WORKSHOP ON CHALLENGES & PERSPECTIVES IN CREATING LARGE LANGUAGE MODELS (BIGSCIENCE EPISODE #5), 2022, : 137 - 145
[10] BAND: Biomedical Alert News Dataset
Fu, Zihao
Zhang, Meiru
Meng, Zaiqiao
Shen, Yannan
Buckeridge, David
Collier, Nigel
[J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 18012 - 18020

← 1 2 3 4 5 →