The impact of imbalanced training data on machine learning for author name disambiguation

被引:33
|
作者
Kim, Jinseok [1 ]
Kim, Jenna [2 ]
机构
[1] Univ Michigan, Survey Res Ctr, Inst Res Innovat & Sci, Inst Social Res, 330 Packard St, Ann Arbor, MI 48104 USA
[2] Syracuse Univ, Sch Informat Studies, 343 Hinds Hall, Syracuse, NY 13210 USA
基金
美国国家科学基金会;
关键词
Author name disambiguation; Negative training data; Imbalanced training data; Supervised machine learning; COVARIATE SHIFT;
D O I
10.1007/s11192-018-2865-9
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers-Logistic Regression, Naive Bayes, and Random Forest-are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-to-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic and Naive Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10 similar to 1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.
引用
收藏
页码:511 / 526
页数:16
相关论文
共 50 条
  • [21] Co-attention-Based Pairwise Learning for Author Name Disambiguation
    Wang, Shenghui
    Li, Qiuke
    Koopman, Rob
    LEVERAGING GENERATIVE INTELLIGENCE IN DIGITAL LIBRARIES: TOWARDS HUMAN-MACHINE COLLABORATION, ICADL 2023, PT II, 2023, 14458 : 240 - 249
  • [22] Data sets for author name disambiguation: an empirical analysis and a new resource
    Mueller, Mark-Christoph
    Reitz, Florian
    Roy, Nicolas
    SCIENTOMETRICS, 2017, 111 (03) : 1467 - 1500
  • [23] Author Name Disambiguation for Ranking and Clustering PubMed Data Using NetClus
    Varadharajalu, Arvin
    Liu, Wei
    Wong, Wilson
    AI 2011: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2011, 7106 : 152 - +
  • [24] Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches
    Tekles, Alexander
    Bornmann, Lutz
    17TH INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS (ISSI2019), VOL II, 2019, : 1548 - 1559
  • [25] Author name disambiguation of bibliometric data: A comparison of several unsupervised approaches
    Tekles, Alexander
    Bornmann, Lutz
    QUANTITATIVE SCIENCE STUDIES, 2020, 1 (04): : 1510 - 1528
  • [26] Data sets for author name disambiguation: an empirical analysis and a new resource
    Mark-Christoph Müller
    Florian Reitz
    Nicolas Roy
    Scientometrics, 2017, 111 : 1467 - 1500
  • [27] The Impact of Imbalanced Training Data on Local Matching Learning of Ontologies
    Laadhar, Amir
    Ghozzi, Faiza
    Megdiche, Imen
    Ravat, Franck
    Teste, Olivier
    Gargouri, Faiez
    BUSINESS INFORMATION SYSTEMS, PT I, 2019, 353 : 162 - 175
  • [28] A supervised machine learning approach to author disambiguation in the Web of Science
    Rehs, Andreas
    JOURNAL OF INFORMETRICS, 2021, 15 (03)
  • [29] Multiple Features Driven Author Name Disambiguation
    Zhou, Qian
    Chen, Wei
    Wang, Weiqing
    Xu, Jiajie
    Zhao, Lei
    2021 IEEE INTERNATIONAL CONFERENCE ON WEB SERVICES, ICWS 2021, 2021, : 506 - 515
  • [30] Author Name Disambiguation Based on Heterogeneous Graph
    Ma, Chuang
    Xia, Helong
    Journal of Computers (Taiwan), 2023, 34 (04) : 41 - 52