Efficient Topic-based Unsupervised Name Disambiguation

被引:77
|
作者
Song, Yang [1 ]
Huang, Jian [2 ]
Councill, Isaac G. [2 ]
Li, Jia [1 ,3 ]
Giles, C. Lee [1 ,2 ]
机构
[1] Penn State Univ, Dept Comp Sci & Engn, University Pk, PA 16802 USA
[2] Penn State Univ, Informat Sci & Technol, University Pk, PA 16802 USA
[3] Penn State Univ, Dept Stat, University Pk, PA 16802 USA
关键词
Unsupervised Machine Learning; Bayesian Models; Name;
D O I
10.1145/1255175.1255243
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. lit the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.
引用
收藏
页码:342 / +
页数:3
相关论文
共 50 条
  • [1] Unsupervised Construction of Topic-based Twitter Lists
    de Villiers, Francois
    Hoffmann, McElory
    Kroon, Steve
    [J]. PROCEEDINGS OF 2012 ASE/IEEE INTERNATIONAL CONFERENCE ON PRIVACY, SECURITY, RISK AND TRUST AND 2012 ASE/IEEE INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING (SOCIALCOM/PASSAT 2012), 2012, : 283 - 292
  • [2] Topic-Based Unsupervised and Supervised Dictionary Induction
    Liu, Yuzhi
    Piccardi, Massimo
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (03)
  • [3] An Unsupervised Heuristic Based Approach for Author Name Disambiguation
    Pooja, K. M.
    Mondal, Samrat
    Chandra, Joydeep
    [J]. 2018 10TH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS & NETWORKS (COMSNETS), 2018, : 540 - 542
  • [4] A constraint-based topic modeling approach for name disambiguation
    Feng Wang
    Jie Tang
    Juanzi Li
    Kehong Wang
    [J]. Frontiers of Computer Science in China, 2010, 4 : 100 - 111
  • [5] A constraint-based topic modeling approach for name disambiguation
    Wang, Feng
    Tang, Jie
    Li, Juanzi
    Wang, Kehong
    [J]. FRONTIERS OF COMPUTER SCIENCE IN CHINA, 2010, 4 (01): : 100 - 111
  • [6] A Topic-based Unsupervised Learning Approach for Online Underground Market Exploration
    Huang, Shin-Ying
    Ban, Tao
    [J]. 2019 18TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS/13TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA SCIENCE AND ENGINEERING (TRUSTCOM/BIGDATASE 2019), 2019, : 208 - 215
  • [7] An Unsupervised Algorithm for Person Name Disambiguation in the Web
    Delgado, Agustin D.
    Martinez, Raquel
    Fresno, Victor
    Montalvo, Soto
    [J]. PROCESAMIENTO DEL LENGUAJE NATURAL, 2014, (53): : 51 - 58
  • [8] Topic-Based Hierarchical Segmentation
    Chien, Jen-Tzung
    Chueh, Chuang-Hua
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 55 - 66
  • [9] An Unsupervised Heuristic-Based Hierarchical Method for Name Disambiguation in Bibliographic Citations
    Cota, Ricardo G.
    Ferreira, Anderson A.
    Nascimento, Cristiano
    Goncalves, Marcos Andre
    Laender, Alberto H. F.
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010, 61 (09): : 1853 - 1870
  • [10] Topic-based heterogeneous rank
    Amjad, Tehmina
    Ding, Ying
    Daud, Ali
    Xu, Jian
    Malic, Vincent
    [J]. SCIENTOMETRICS, 2015, 104 (01) : 313 - 334