The impact of imbalanced training data on machine learning for author name disambiguation

被引:31
|
作者
Kim, Jinseok [1 ]
Kim, Jenna [2 ]
机构
[1] Univ Michigan, Survey Res Ctr, Inst Res Innovat & Sci, Inst Social Res, 330 Packard St, Ann Arbor, MI 48104 USA
[2] Syracuse Univ, Sch Informat Studies, 343 Hinds Hall, Syracuse, NY 13210 USA
基金
美国国家科学基金会;
关键词
Author name disambiguation; Negative training data; Imbalanced training data; Supervised machine learning; COVARIATE SHIFT;
D O I
10.1007/s11192-018-2865-9
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers-Logistic Regression, Naive Bayes, and Random Forest-are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-to-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic and Naive Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10 similar to 1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.
引用
收藏
页码:511 / 526
页数:16
相关论文
共 50 条
  • [41] Generating automatically labeled data for author name disambiguation: an iterative clustering method
    Kim, Jinseok
    Kim, Jinmo
    Owen-Smith, Jason
    [J]. SCIENTOMETRICS, 2019, 118 (01) : 253 - 280
  • [42] Machine learning for mining imbalanced data
    Arafat, Md. Yasir
    Hoque, Sabera
    Xu, Shuxiang
    Farid, Dewan Md
    [J]. IAENG International Journal of Computer Science, 2019, 46 (02) : 332 - 348
  • [43] Off-the-shelf Semantic Author Name Disambiguation for Bibliographic Data Bases
    Mueller, Mark-Christoph
    Bannister, Adam
    Reitz, Florian
    [J]. DIGITAL LIBRARIES FOR OPEN KNOWLEDGE, TPDL 2019, 2019, 11799 : 397 - 400
  • [44] Generating automatically labeled data for author name disambiguation: an iterative clustering method
    Jinseok Kim
    Jinmo Kim
    Jason Owen-Smith
    [J]. Scientometrics, 2019, 118 : 253 - 280
  • [45] ORCID-linked labeled data for evaluating author name disambiguation at scale
    Kim, Jinseok
    Owen-Smith, Jason
    [J]. SCIENTOMETRICS, 2021, 126 (03) : 2057 - 2083
  • [46] Imbalanced generative sampling of training data for improving quality of machine learning model
    Coskun, Umut Can
    Dogan, Kemal Mert
    Gunpinar, Erkan
    [J]. ADVANCED ENGINEERING INFORMATICS, 2024, 62
  • [47] Author Name Disambiguation Based on Semi-supervised Learning with Graph Convolutional Network
    Sheng Xiaoguang
    Wang Ying
    Qian Li
    [J]. JOURNAL OF ELECTRONICS & INFORMATION TECHNOLOGY, 2021, 43 (12) : 3442 - 3450
  • [48] Framework for Author Name Disambiguation in Scientific Papers Using an Ontological Approach and Deep Learning
    Diaz-de-la-Paz, Lisandra
    Concepcion-Perez, Leonardo
    Armando Portal-Diaz, Jorge
    Taboada-Crispi, Alberto
    Abel Leiva-Mederos, Amed
    [J]. KNOWLEDGE GRAPHS AND SEMANTIC WEB, KGSWC 2022, 2022, 1686 : 216 - 233
  • [49] Author Name Disambiguation by Using Deep Neural Network
    Hung Nghiep Tran
    Tin Huynh
    Tien Do
    [J]. INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT 1, 2014, 8397 : 123 - 132
  • [50] Dynamic author name disambiguation for growing digital libraries
    Qian, Yanan
    Zheng, Qinghua
    Sakai, Tetsuya
    Ye, Junting
    Liu, Jun
    [J]. INFORMATION RETRIEVAL JOURNAL, 2015, 18 (05): : 379 - 412