A supervised machine learning approach to author disambiguation in the Web of Science

被引:12
|
作者
Rehs, Andreas [1 ]
机构
[1] Deutsch Bundesbank, Frankfurt, Germany
关键词
Author name disambiguation; Machine learning; Pairwise classification; Random forest; Community detection; Web of science; NAME DISAMBIGUATION; INFORMATION; IMPACT; MODEL;
D O I
10.1016/j.joi.2021.101166
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Author-level scientometric indicators are an important tool in individual and institutional-based research assessment and require high-quality author-publication profiles. To address this need, our study developed a robust supervised machine learning approach in combination with graph community detection methods to disambiguate author names in the Web of Science publication database. We used the unique author identifier Researcher ID to retrieve true authorship data of 1,904 scientists and trained a random forest and a logistic regression classifier on 1.2 million corresponding publication pairs with authors that share the same last name and first name initial. To do this, we reviewed a vast set of paper and author characteristics and randomly included missing data to make our machine learning robust to quality changes of new publication data. In the application on an unseen test set, we achieved F1 scores of 0.82 in the random forest and 0.75 in the logistic regression model. Subsequently, we evaluate feature performance and apply the infomap graph community detection algorithm to identify all publications belonging to an author. The community detection results in reasonable cluster metrics (Mean K-Metric in logistic regression-based model = 0.78 and = 0.81 in random forest-based model). Finally, we test our algorithm on a large surname-initial block ( "Muller, M. ") and demonstrate speed and predictive performance.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Relational Machine Learning Author Disambiguation
    Bastrakova, Ekaterina
    Ledesma, Rodney
    Milian, Jose
    Rico, Fabien
    Zighed, Djamel
    [J]. PROCEEDINGS OF THE 2016 IEEE ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE CONFERENCE (AINL FRUCT 2016), 2016, : 14 - 20
  • [2] Improving homograph disambiguation with supervised machine learning
    Gorman, Kyle
    Mazovetskiy, Gleb
    Nikolaev, Vitaly
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1349 - 1352
  • [3] Ethnicity-based name partitioning for author name disambiguation using supervised machine learning
    Kim, Jinseok
    Kim, Jenna
    Owen-Smith, Jason
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2021, 72 (08) : 979 - 994
  • [4] Two supervised learning approaches for name disambiguation in author citations
    Han, H
    Giles, L
    Zha, H
    Li, C
    Tsioutsiouliklis, K
    [J]. JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, : 296 - 305
  • [5] Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
    Louppe, Gilles
    Al-Natsheh, Hussein T.
    Susik, Mateusz
    Maguire, Eamonn James
    [J]. KNOWLEDGE ENGINEERING AND SEMANTIC WEB, KESW 2016, 2016, 649 : 272 - 287
  • [6] Detecting authorship deception: a supervised machine learning approach using author writeprints
    Pearl, Lisa
    Steyvers, Mark
    [J]. LITERARY AND LINGUISTIC COMPUTING, 2012, 27 (02): : 183 - 196
  • [7] Model Reuse in Machine Learning for Author Name Disambiguation: An Exploration of Transfer Learning
    Kim, Jinseok
    Owen-Smith, Jason
    [J]. IEEE ACCESS, 2020, 8 (08): : 188378 - 188389
  • [8] Disambiguation of author entities in ADS using supervised learning and graph theory methods
    Helena Mihaljević
    Lucía Santamaría
    [J]. Scientometrics, 2021, 126 : 3893 - 3917
  • [9] Disambiguation of author entities in ADS using supervised learning and graph theory methods
    Mihaljevic, Helena
    Santamaria, Lucia
    [J]. SCIENTOMETRICS, 2021, 126 (05) : 3893 - 3917
  • [10] Sense disambiguation for Punjabi language using supervised machine learning techniques
    Singh, Varinder Pal
    Kumar, Parteek
    [J]. SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2019, 44 (11):