The impact of imbalanced training data on machine learning for author name disambiguation

被引:0
|
作者
Jinseok Kim
Jenna Kim
机构
[1] University of Michigan,Institute for Research on Innovation and Science, Survey Research Center, Institute for Social Research
[2] Syracuse University,School of Information Studies
来源
Scientometrics | 2018年 / 117卷
关键词
Author name disambiguation; Negative training data; Imbalanced training data; Supervised machine learning;
D O I
暂无
中图分类号
学科分类号
摘要
In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers—Logistic Regression, Naïve Bayes, and Random Forest—are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-to-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic and Naïve Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10~1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.
引用
收藏
页码:511 / 526
页数:15
相关论文
共 50 条
  • [11] Author Name Disambiguation in MEDLINE
    Torvik, Vetle I.
    Smalheiser, Neil R.
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2009, 3 (03)
  • [12] Author Name Disambiguation for PubMed
    Liu, Wanli
    Dogan, Rezarta Islamaj
    Kim, Sun
    Comeau, Donald C.
    Kim, Won
    Yeganova, Lana
    Lu, Zhiyong
    Wilbur, W. John
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2014, 65 (04) : 765 - 781
  • [13] Learning semantic and relationship joint embedding for author name disambiguation
    Xiong, Bo
    Bao, Peng
    Wu, Yilin
    [J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (06): : 1987 - 1998
  • [14] Learning semantic and relationship joint embedding for author name disambiguation
    Xiong, Bo
    Bao, Peng
    Wu, Yilin
    [J]. Neural Computing and Applications, 2021, 33 (06) : 1987 - 1998
  • [15] Two supervised learning approaches for name disambiguation in author citations
    Han, H
    Giles, L
    Zha, H
    Li, C
    Tsioutsiouliklis, K
    [J]. JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, : 296 - 305
  • [16] Learning semantic and relationship joint embedding for author name disambiguation
    Bo Xiong
    Peng Bao
    Yilin Wu
    [J]. Neural Computing and Applications, 2021, 33 : 1987 - 1998
  • [17] Whois? Deep Author Name Disambiguation Using Bibliographic Data
    Boukhers, Zeyd
    Asundi, Nagaraj Bahubali
    [J]. LINKING THEORY AND PRACTICE OF DIGITAL LIBRARIES (TPDL 2022), 2022, 13541 : 201 - 215
  • [18] Self-Training Author Name Disambiguation for Information Scarce Scenarios
    Ferreira, Anderson A.
    Veloso, Adriano
    Goncalves, Marcos Andre
    Laender, Alberto H. F.
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2014, 65 (06) : 1257 - 1278
  • [19] An Efficient Technique for Author Name Disambiguation
    Hazra, Rima
    Saha, Anomitra
    Deb, Shubhra Baran
    Mitra, Debasis
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN ADVANCED COMPUTING (ICCTAC), 2016,
  • [20] Author Name Disambiguation on Heterogeneous Information Network with Adversarial Representation Learning
    Wang, Haiwen
    Wang, Ruijie
    Wen, Chuan
    Li, Shuhao
    Jia, Yuting
    Zhang, Weinan
    Wang, Xinbing
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 238 - 245