unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata

被引:0
|
作者
Tarek Saier
Michael Färber
机构
[1] Karlsruhe Institute of Technology (KIT),Institute AIFB
来源
Scientometrics | 2020年 / 125卷
关键词
Scholarly data; Citations; arXiv.org; Digital libraries; Data set;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards. In this paper, we propose a new data set based on all publications from all scientific disciplines available on arXiv.org. Apart from providing the papers’ plain text, in-text citations were annotated via global identifiers. Furthermore, citing and cited publications were linked to the Microsoft Academic Graph, providing access to rich metadata. Our data set consists of over one million documents and 29.2 million citation contexts. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches, but also serve as a basis for new ways to analyze in-text citations, as we show prototypically in this article.
引用
收藏
页码:3085 / 3108
页数:23
相关论文
共 8 条
  • [1] unarXive: a large scholarly data set with publications' full-text, annotated in-text citations, and links to metadata
    Saier, Tarek
    Faerber, Michael
    [J]. SCIENTOMETRICS, 2020, 125 (03) : 3085 - 3108
  • [2] Data set Mentions and Citations: A Content Analysis of Full-Text Publications
    Zhao, Mengnan
    Yan, Erjia
    Li, Kai
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2018, 69 (01) : 32 - 46
  • [3] Deep Learning-based Extraction of Algorithmic Metadata in Full-Text Scholarly Documents
    Safder, Iqra
    Hassan, Saeed-Ul
    Visvizi, Anna
    Noraset, Thanapon
    Nawaz, Raheel
    Tuarob, Suppawong
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (06)
  • [4] Deep context of citations using machine-learning models in scholarly full-text articles
    Saeed-Ul Hassan
    Mubashir Imran
    Sehrish Iqbal
    Naif Radi Aljohani
    Raheel Nawaz
    [J]. Scientometrics, 2018, 117 : 1645 - 1662
  • [5] Deep context of citations using machine-learning models in scholarly full-text articles
    Hassan, Saeed-Ul
    Imran, Mubashir
    Iqbal, Sehrish
    Aljohani, Naif Radi
    Nawaz, Raheel
    [J]. SCIENTOMETRICS, 2018, 117 (03) : 1645 - 1662
  • [6] unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including Structured Full-Text and Citation Network
    Saier, Tarek
    Krause, Johan
    Faerber, Michael
    [J]. 2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 66 - 70
  • [7] DS4A: Deep Search System for Algorithms from Full-text Scholarly Big Data
    Safder, Iqra
    Saeed-Ul Hassan
    [J]. 2018 18TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2018, : 1308 - 1315
  • [8] Linking full-text grey literature to underlying research and post-publication data: An Enhanced Publications Project 2011-2012
    Farace, Dominic J.
    Frantzen, Jerry
    Stock, Christiane
    Sesink, Laurents
    Rabina, Debbie
    [J]. GREY CIRCUIT: FROM SOCIAL NETWORKING TO WEALTH CREATION, 2012, 13 : 143 - +