The impact of imbalanced training data on machine learning for author name disambiguation

被引：0

作者：

Jinseok Kim

Jenna Kim

机构：

[1] University of Michigan,Institute for Research on Innovation and Science, Survey Research Center, Institute for Social Research

[2] Syracuse University,School of Information Studies

来源：

Scientometrics | 2018年 / 117卷

关键词：

Author name disambiguation; Negative training data; Imbalanced training data; Supervised machine learning;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

In supervised machine learning for author name disambiguation, negative training data are often dominantly larger than positive training data. This paper examines how the ratios of negative to positive training data can affect the performance of machine learning algorithms to disambiguate author names in bibliographic records. On multiple labeled datasets, three classifiers—Logistic Regression, Naïve Bayes, and Random Forest—are trained through representative features such as coauthor names, and title words extracted from the same training data but with various positive-to-negative training data ratios. Results show that increasing negative training data can improve disambiguation performance but with a few percent of performance gains and sometimes degrade it. Logistic and Naïve Bayes learn optimal disambiguation models even with a base ratio (1:1) of positive and negative training data. Also, the performance improvement by Random Forest tends to quickly saturate roughly after 1:10~1:15. These findings imply that contrary to the common practice using all training data, name disambiguation algorithms can be trained using part of negative training data without degrading much disambiguation performance while increasing computational efficiency. This study calls for more attention from author name disambiguation scholars to methods for machine learning from imbalanced data.

引用

页码：511 / 526

页数：15

共 50 条

[11] Author Name Disambiguation in MEDLINE
Torvik, Vetle I.
Smalheiser, Neil R.
[J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2009, 3 (03)
[12] Author Name Disambiguation for PubMed
Liu, Wanli
Dogan, Rezarta Islamaj
Kim, Sun
Comeau, Donald C.
Kim, Won
Yeganova, Lana
Lu, Zhiyong
Wilbur, W. John
[J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2014, 65 (04) : 765 - 781
[13] Learning semantic and relationship joint embedding for author name disambiguation
Xiong, Bo
Bao, Peng
Wu, Yilin
[J]. NEURAL COMPUTING & APPLICATIONS, 2021, 33 (06): : 1987 - 1998
[14] Learning semantic and relationship joint embedding for author name disambiguation
Xiong, Bo
Bao, Peng
Wu, Yilin
[J]. Neural Computing and Applications, 2021, 33 (06) : 1987 - 1998
[15] Two supervised learning approaches for name disambiguation in author citations
Han, H
Giles, L
Zha, H
Li, C
Tsioutsiouliklis, K
[J]. JCDL 2004: PROCEEDINGS OF THE FOURTH ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES: GLOBAL REACH AND DIVERSE IMPACT, 2004, : 296 - 305
[16] Learning semantic and relationship joint embedding for author name disambiguation
Bo Xiong
Peng Bao
Yilin Wu
[J]. Neural Computing and Applications, 2021, 33 : 1987 - 1998
[17] Whois? Deep Author Name Disambiguation Using Bibliographic Data
Boukhers, Zeyd
Asundi, Nagaraj Bahubali
[J]. LINKING THEORY AND PRACTICE OF DIGITAL LIBRARIES (TPDL 2022), 2022, 13541 : 201 - 215
[18] Self-Training Author Name Disambiguation for Information Scarce Scenarios
Ferreira, Anderson A.
Veloso, Adriano
Goncalves, Marcos Andre
Laender, Alberto H. F.
[J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2014, 65 (06) : 1257 - 1278
[19] An Efficient Technique for Author Name Disambiguation
Hazra, Rima
Saha, Anomitra
Deb, Shubhra Baran
Mitra, Debasis
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN ADVANCED COMPUTING (ICCTAC), 2016,
[20] Author Name Disambiguation on Heterogeneous Information Network with Adversarial Representation Learning
Wang, Haiwen
Wang, Ruijie
Wen, Chuan
Li, Shuhao
Jia, Yuting
Zhang, Weinan
Wang, Xinbing
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 238 - 245

← 1 2 3 4 5 →