Disambiguation of author entities in ADS using supervised learning and graph theory methods

被引：9

作者：

Mihaljevic, Helena ^{[1
]}

Santamaria, Lucia ^{[2
]}

机构：

[1] Hsch Tech & Wirtschaft, Wilhelminenhofstr 75A, D-12459 Berlin, Germany

[2] Amazon Dev Ctr, Charlottenstr 4, D-10969 Berlin, Germany

来源：

SCIENTOMETRICS | 2021年 / 126卷 / 05期

关键词：

Author name disambiguation; Record linkage; Supervised learning; Label Propagation; Information retrieval; Digital libraries; NAME DISAMBIGUATION;

D O I：

10.1007/s11192-021-03951-w

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Disambiguation of authors in digital libraries is essential for many tasks, including efficient bibliographical searches and scientometric analyses to the level of individuals. The question of how to link documents written by the same person has been given much attention by academic publishers and information retrieval researchers alike. Usual approaches rely on publications' metadata such as affiliations, email addresses, co-authors, or scholarly topics. Lack of homogeneity in the structure of bibliographic collections and discipline-specific dissimilarities between them make the creation of general-purpose disambiguators arduous. We present an algorithm to disambiguate authorships in the Astrophysics Data System (ADS) following an established semi-supervised approach of training a classifier on authorship pairs and clustering the resulting graphs. Due to the lack of high-signal features such as email addresses and citations, we engineer additional content- and location-based features via text embeddings and named-entity recognition. We train various nonlinear tree-based classifiers and detect communities from the resulting weighted graphs through label propagation, a fast yet efficient algorithm that requires no tuning. The resulting procedure reaches reasonable complexity and offers possibilities for interpretation. We apply our method to the creation of author entities in a recent ADS snapshot. The algorithm is evaluated on 39 manually-labeled author blocks comprising 9545 authorships from 562 author profiles. Our best approach utilizes the Random Forest classifier and yields a micro- and macro-averaged BCubed F-1 score of 0.95 and 0.87, respectively. We release our code and labeled data publicly to foster the development of further disambiguation procedures for ADS.

引用

页码：3893 / 3917

页数：25

共 50 条

[1] Disambiguation of author entities in ADS using supervised learning and graph theory methods
Helena Mihaljević
Lucía Santamaría
[J]. Scientometrics, 2021, 126 : 3893 - 3917
[2] Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network
Sheng Xiaoguang
Wang Ying
Qian Li
[J]. JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2021, 43 (12) : 3442 - 3450
[3] Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
Louppe, Gilles
Al-Natsheh, Hussein T.
Susik, Mateusz
Maguire, Eamonn James
[J]. KNOWLEDGE ENGINEERING AND SEMANTIC WEB, KESW 2016, 2016, 649 : 272 - 287
[4] A supervised machine learning approach to author disambiguation in the Web of Science
Rehs, Andreas
[J]. JOURNAL OF INFORMETRICS, 2021, 15 (03)
[5] Two supervised learning approaches for name disambiguation in author citations
Han, H
Giles, L
Zha, H
Li, C
Tsioutsiouliklis, K
[J]. JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, : 296 - 305
[6] Graph-based methods for Author Name Disambiguation: a survey
De Bonis, Michele
Falchi, Fabrizio
Manghi, Paolo
[J]. PEERJ COMPUTER SCIENCE, 2023, 9
[7] Ethnicity-based name partitioning for author name disambiguation using supervised machine learning
Kim, Jinseok
Kim, Jenna
Owen-Smith, Jason
[J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2021, 72 (08) : 979 - 994
[8] LUCID: Author Name Disambiguation using Graph Structural Clustering
Hussain, Ijaz
Asghar, Sohail
[J]. PROCEEDINGS OF THE 2017 INTELLIGENT SYSTEMS CONFERENCE (INTELLISYS), 2017, : 406 - 413
[9] Author Name Disambiguation Using Graph Node Embedding Method
Zhang, Wenjing
Yan, Zhongmin
Zheng, Yongqing
[J]. PROCEEDINGS OF THE 2019 IEEE 23RD INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2019, : 410 - 415
[10] Author Name Disambiguation Using Multiple Graph Attention Networks
Zhang, Zhiqiang
Wu, Chunqi
Li, Zhao
Peng, Juanjuan
Wu, Haiyan
Song, Haiyu
Deng, Shengchun
Wang, Biao
[J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,

← 1 2 3 4 5 →