The impact of imbalanced training data on machine learning for author name disambiguation

被引:0
|
作者
Jinseok Kim
Jenna Kim
机构
[1] University of Michigan,Institute for Research on Innovation and Science, Survey Research Center, Institute for Social Research
[2] Syracuse University,School of Information Studies
来源
Scientometrics | 2018年 / 117卷
关键词
Author name disambiguation; Negative training data; Imbalanced training data; Supervised machine learning;
D O I
暂无
中图分类号
学科分类号
摘要
In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers—Logistic Regression, Naïve Bayes, and Random Forest—are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-to-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic and Naïve Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10~1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.
引用
收藏
页码:511 / 526
页数:15
相关论文
共 50 条
  • [1] The impact of imbalanced training data on machine learning for author name disambiguation
    Kim, Jinseok
    Kim, Jenna
    [J]. SCIENTOMETRICS, 2018, 117 (01) : 511 - 526
  • [2] Model Reuse in Machine Learning for Author Name Disambiguation: An Exploration of Transfer Learning
    Kim, Jinseok
    Owen-Smith, Jason
    [J]. IEEE ACCESS, 2020, 8 (08): : 188378 - 188389
  • [3] Ethnicity-based name partitioning for author name disambiguation using supervised machine learning
    Kim, Jinseok
    Kim, Jenna
    Owen-Smith, Jason
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2021, 72 (08) : 979 - 994
  • [4] Author Name Disambiguation
    Smalheiser, Neil R.
    Torvik, Vetle I.
    [J]. ANNUAL REVIEW OF INFORMATION SCIENCE AND TECHNOLOGY, 2009, 43 : 287 - 313
  • [5] Relational Machine Learning Author Disambiguation
    Bastrakova, Ekaterina
    Ledesma, Rodney
    Milian, Jose
    Rico, Fabien
    Zighed, Djamel
    [J]. PROCEEDINGS OF THE 2016 IEEE ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE CONFERENCE (AINL FRUCT 2016), 2016, : 14 - 20
  • [6] ANDez: An open-source tool for author name disambiguation using machine learning
    Kim, Jinseok
    Kim, Jenna
    [J]. SOFTWAREX, 2024, 26
  • [7] Effect of Chinese characters on machine learning for Chinese author name disambiguation: A counterfactual evaluation
    Kim, Jinseok
    Kim, Jenna
    Kim, Jinmo
    [J]. JOURNAL OF INFORMATION SCIENCE, 2023, 49 (03) : 711 - 725
  • [8] The Impact of Name-Matching and Blocking on Author Disambiguation
    Backes, Tobias
    [J]. CIKM'18: PROCEEDINGS OF THE 27TH ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2018, : 803 - 812
  • [9] Deep author name disambiguation using DBLP data
    Boukhers, Zeyd
    Asundi, Nagaraj Bahubali
    [J]. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2024, 25 (03) : 431 - 441
  • [10] Incremental Author Name Disambiguation for Scientific Citation Data
    Zhao, Zhengqiao
    Rollins, Jason
    Bai, Linge
    Rosen, Gail
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA), 2017, : 175 - 183