Disambiguation of author entities in ADS using supervised learning and graph theory methods

被引:9
|
作者
Mihaljevic, Helena [1 ]
Santamaria, Lucia [2 ]
机构
[1] Hsch Tech & Wirtschaft, Wilhelminenhofstr 75A, D-12459 Berlin, Germany
[2] Amazon Dev Ctr, Charlottenstr 4, D-10969 Berlin, Germany
关键词
Author name disambiguation; Record linkage; Supervised learning; Label Propagation; Information retrieval; Digital libraries; NAME DISAMBIGUATION;
D O I
10.1007/s11192-021-03951-w
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Disambiguation of authors in digital libraries is essential for many tasks, including efficient bibliographical searches and scientometric analyses to the level of individuals. The question of how to link documents written by the same person has been given much attention by academic publishers and information retrieval researchers alike. Usual approaches rely on publications' metadata such as affiliations, email addresses, co-authors, or scholarly topics. Lack of homogeneity in the structure of bibliographic collections and discipline-specific dissimilarities between them make the creation of general-purpose disambiguators arduous. We present an algorithm to disambiguate authorships in the Astrophysics Data System (ADS) following an established semi-supervised approach of training a classifier on authorship pairs and clustering the resulting graphs. Due to the lack of high-signal features such as email addresses and citations, we engineer additional content- and location-based features via text embeddings and named-entity recognition. We train various nonlinear tree-based classifiers and detect communities from the resulting weighted graphs through label propagation, a fast yet efficient algorithm that requires no tuning. The resulting procedure reaches reasonable complexity and offers possibilities for interpretation. We apply our method to the creation of author entities in a recent ADS snapshot. The algorithm is evaluated on 39 manually-labeled author blocks comprising 9545 authorships from 562 author profiles. Our best approach utilizes the Random Forest classifier and yields a micro- and macro-averaged BCubed F-1 score of 0.95 and 0.87, respectively. We release our code and labeled data publicly to foster the development of further disambiguation procedures for ADS.
引用
收藏
页码:3893 / 3917
页数:25
相关论文
共 50 条
  • [1] Disambiguation of author entities in ADS using supervised learning and graph theory methods
    Helena Mihaljević
    Lucía Santamaría
    [J]. Scientometrics, 2021, 126 : 3893 - 3917
  • [2] Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network
    Sheng Xiaoguang
    Wang Ying
    Qian Li
    [J]. JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2021, 43 (12) : 3442 - 3450
  • [3] Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
    Louppe, Gilles
    Al-Natsheh, Hussein T.
    Susik, Mateusz
    Maguire, Eamonn James
    [J]. KNOWLEDGE ENGINEERING AND SEMANTIC WEB, KESW 2016, 2016, 649 : 272 - 287
  • [4] A supervised machine learning approach to author disambiguation in the Web of Science
    Rehs, Andreas
    [J]. JOURNAL OF INFORMETRICS, 2021, 15 (03)
  • [5] Two supervised learning approaches for name disambiguation in author citations
    Han, H
    Giles, L
    Zha, H
    Li, C
    Tsioutsiouliklis, K
    [J]. JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, : 296 - 305
  • [6] Graph-based methods for Author Name Disambiguation: a survey
    De Bonis, Michele
    Falchi, Fabrizio
    Manghi, Paolo
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [7] Ethnicity-based name partitioning for author name disambiguation using supervised machine learning
    Kim, Jinseok
    Kim, Jenna
    Owen-Smith, Jason
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2021, 72 (08) : 979 - 994
  • [8] LUCID: Author Name Disambiguation using Graph Structural Clustering
    Hussain, Ijaz
    Asghar, Sohail
    [J]. PROCEEDINGS OF THE 2017 INTELLIGENT SYSTEMS CONFERENCE (INTELLISYS), 2017, : 406 - 413
  • [9] Author Name Disambiguation Using Graph Node Embedding Method
    Zhang, Wenjing
    Yan, Zhongmin
    Zheng, Yongqing
    [J]. PROCEEDINGS OF THE 2019 IEEE 23RD INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2019, : 410 - 415
  • [10] Author Name Disambiguation Using Multiple Graph Attention Networks
    Zhang, Zhiqiang
    Wu, Chunqi
    Li, Zhao
    Peng, Juanjuan
    Wu, Haiyan
    Song, Haiyu
    Deng, Shengchun
    Wang, Biao
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,