The impact of imbalanced training data on machine learning for author name disambiguation

被引:31
|
作者
Kim, Jinseok [1 ]
Kim, Jenna [2 ]
机构
[1] Univ Michigan, Survey Res Ctr, Inst Res Innovat & Sci, Inst Social Res, 330 Packard St, Ann Arbor, MI 48104 USA
[2] Syracuse Univ, Sch Informat Studies, 343 Hinds Hall, Syracuse, NY 13210 USA
基金
美国国家科学基金会;
关键词
Author name disambiguation; Negative training data; Imbalanced training data; Supervised machine learning; COVARIATE SHIFT;
D O I
10.1007/s11192-018-2865-9
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers-Logistic Regression, Naive Bayes, and Random Forest-are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-to-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic and Naive Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10 similar to 1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.
引用
收藏
页码:511 / 526
页数:16
相关论文
共 50 条
  • [31] Author Name Disambiguation Using Predictive Models
    Talaba, George
    Fotache, Mann
    [J]. EDUCATION EXCELLENCE AND INNOVATION MANAGEMENT THROUGH VISION 2020, 2019, : 4703 - 4710
  • [32] Using Web Information for Author Name Disambiguation
    Pereira, Denilson Alves
    Ribeiro-Neto, Berthier
    Ziviani, Nivio
    Laender, Alberto H. F.
    Goncalves, Marcos Andre
    Ferreira, Anderson A.
    [J]. JCDL 09: PROCEEDINGS OF THE 2009 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, 2009, : 49 - 58
  • [33] The scientific productivity of German PhD graduates: A machine learning-based author name disambiguation and record linkage approach
    Rehs, Andreas
    [J]. 18TH INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS (ISSI2021), 2021, : 1531 - 1532
  • [34] ORCID-linked labeled data for evaluating author name disambiguation at scale
    Jinseok Kim
    Jason Owen-Smith
    [J]. Scientometrics, 2021, 126 : 2057 - 2083
  • [35] Semantic Author Name Disambiguation with Word Embeddings
    Mueller, Mark-Christoph
    [J]. RESEARCH AND ADVANCED TECHNOLOGY FOR DIGITAL LIBRARIES (TPDL 2017), 2017, 10450 : 300 - 311
  • [36] Towards a Flexible Author Name Disambiguation Framework
    Bolikowski, Lukasz
    Dendek, Piotr Jan
    [J]. DML 2011: TOWARDS A DIGITAL MATHEMATICS LIBRARY, 2011, : 27 - 37
  • [37] A Visual Analytics Approach to Author Name Disambiguation
    Muelder, Chris W.
    Faris, Robert
    Ma, Kwan-Liu
    [J]. 2016 3RD IEEE/ACM INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING, APPLICATIONS AND TECHNOLOGIES (BDCAT), 2016, : 52 - 60
  • [38] Toward a New Paradigm for Author Name Disambiguation
    Manzoor, Ayesha
    Asghar, Sohail
    Amjad, Tehmina
    [J]. IEEE ACCESS, 2022, 10 : 76055 - 76068
  • [39] Author Name Disambiguation for Citations on the Deep Web
    Zhang, Rui
    Shen, Derong
    Kou, Yue
    Nie, Tiezheng
    [J]. WEB-AGE INFORMATION MANAGEMENT, 2010, 6185 : 198 - 209
  • [40] Effect of forename string on author name disambiguation
    Kim, Jinseok
    Kim, Jenna
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2020, 71 (07) : 839 - 855