Ethnicity-based name partitioning for author name disambiguation using supervised machine learning

被引:10
|
作者
Kim, Jinseok [1 ]
Kim, Jenna [2 ]
Owen-Smith, Jason [3 ]
机构
[1] Univ Michigan, Inst Social Res, Survey Res Ctr, Inst Res Innovat & Sci, 330 Packard St, Ann Arbor, MI 48104 USA
[2] Univ Illinois, Sch Informat Sci, Champaign, IL USA
[3] Univ Michigan, Inst Social Res, Dept Sociol, Ann Arbor, MI USA
基金
美国国家科学基金会;
关键词
ACCURACY;
D O I
10.1002/asi.24459
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In several author name disambiguation studies, some ethnic name groups such as East Asian names are reported to be more difficult to disambiguate than others. This implies that disambiguation approaches might be improved if ethnic name groups are distinguished before disambiguation. We explore the potential of ethnic name partitioning by comparing performance of four machine learning algorithms trained and tested on the entire data or specifically on individual name groups. Results show that ethnicity-based name partitioning can substantially improve disambiguation performance because the individual models are better suited for their respective name group. The improvements occur across all ethnic name groups with different magnitudes. Performance gains in predicting matched name pairs outweigh losses in predicting nonmatched pairs. Feature (e.g., coauthor name) similarities of name pairs vary across ethnic name groups. Such differences may enable the development of ethnicity-specific feature weights to improve prediction for specific ethic name categories. These findings are observed for three labeled data with a natural distribution of problem sizes as well as one in which all ethnic name groups are controlled for the same sizes of ambiguous names. This study is expected to motive scholars to group author names based on ethnicity prior to disambiguation.
引用
收藏
页码:979 / 994
页数:16
相关论文
共 50 条
  • [1] Two supervised learning approaches for name disambiguation in author citations
    Han, H
    Giles, L
    Zha, H
    Li, C
    Tsioutsiouliklis, K
    [J]. JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, : 296 - 305
  • [2] Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network
    Sheng Xiaoguang
    Wang Ying
    Qian Li
    [J]. JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2021, 43 (12) : 3442 - 3450
  • [3] Model Reuse in Machine Learning for Author Name Disambiguation: An Exploration of Transfer Learning
    Kim, Jinseok
    Owen-Smith, Jason
    [J]. IEEE ACCESS, 2020, 8 (08): : 188378 - 188389
  • [4] Author Name Disambiguation
    Smalheiser, Neil R.
    Torvik, Vetle I.
    [J]. ANNUAL REVIEW OF INFORMATION SCIENCE AND TECHNOLOGY, 2009, 43 : 287 - 313
  • [5] The impact of imbalanced training data on machine learning for author name disambiguation
    Jinseok Kim
    Jenna Kim
    [J]. Scientometrics, 2018, 117 : 511 - 526
  • [6] The impact of imbalanced training data on machine learning for author name disambiguation
    Kim, Jinseok
    Kim, Jenna
    [J]. SCIENTOMETRICS, 2018, 117 (01) : 511 - 526
  • [7] ANDez: An open-source tool for author name disambiguation using machine learning
    Kim, Jinseok
    Kim, Jenna
    [J]. SOFTWAREX, 2024, 26
  • [8] Author Name Disambiguation Using Predictive Models
    Talaba, George
    Fotache, Mann
    [J]. EDUCATION EXCELLENCE AND INNOVATION MANAGEMENT THROUGH VISION 2020, 2019, : 4703 - 4710
  • [9] Using Web Information for Author Name Disambiguation
    Pereira, Denilson Alves
    Ribeiro-Neto, Berthier
    Ziviani, Nivio
    Laender, Alberto H. F.
    Goncalves, Marcos Andre
    Ferreira, Anderson A.
    [J]. JCDL 09: PROCEEDINGS OF THE 2009 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, 2009, : 49 - 58
  • [10] Author Name Disambiguation Based on Heterogeneous Graph
    Ma, Chuang
    Xia, Helong
    [J]. Journal of Computers (Taiwan), 2023, 34 (04) : 41 - 52