Convergence Diagnostics for Entity Resolution

被引:0
|
作者
Aleshin-Guendel, Serge [1 ]
Steorts, Rebecca C. [1 ,2 ,3 ,4 ]
机构
[1] Duke Univ, Dept Stat Sci, Durham, NC 27706 USA
[2] Duke Univ, Rhodes Informat Initiat Duke iiD, Dept Biostat & Bioinformat, Dept Comp Sci, Durham, NC USA
[3] Duke Univ, Social Sci Res Inst SSRI, Durham, NC USA
[4] US Census Bur, Ctr Stat Res & Methodol, Suitland, MD USA
关键词
convergence diagnostics; duplicate detection; entity resolution; Markov chain Monte Carlo; partition; record linkage; CHAIN-MONTE-CARLO; RECORD-LINKAGE; POPULATION-SIZE; BAYESIAN-APPROACH; MINORIZATION;
D O I
10.1146/annurev-statistics-040522-114848
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Entity resolution is the process of merging and removing duplicate records from multiple data sources, often in the absence of unique identifiers. Bayesian models for entity resolution allow one to include a priori information, quantify uncertainty in important applications, and directly estimate a partition of the records. Markov chain Monte Carlo (MCMC) sampling is the primary computational method for approximate posterior inference in this setting, but due to the high dimensionality of the space of partitions, there are no agreed upon standards for diagnosing nonconvergence of MCMC sampling. In this article, we review Bayesian entity resolution, with a focus on the specific challenges that it poses for the convergence of a Markov chain. We review prior methods for convergence diagnostics, discussing their weaknesses. We provide recommendations for using MCMC sampling for Bayesian entity resolution, focusing on the use of modern diagnostics that are commonplace in applied Bayesian statistics. Using simulated data, we find that a commonly used Gibbs sampler performs poorly compared with two alternatives.
引用
收藏
页码:419 / 435
页数:17
相关论文
共 50 条
  • [1] Provenance for Entity Resolution
    Oppold, Sarah
    Herschel, Melanie
    PROVENANCE AND ANNOTATION OF DATA AND PROCESSES, IPAW 2018, 2018, 11017 : 226 - 230
  • [2] Joint Entity Resolution
    Whang, Steven Euijong
    Garcia-Molina, Hector
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 294 - 305
  • [3] Skyblocking for entity resolution
    Shao, Jingyu
    Wang, Qing
    Lin, Yu
    INFORMATION SYSTEMS, 2019, 85 : 30 - 43
  • [4] Geospatial Entity Resolution
    Balsebre, Pasquale
    Yao, Dezhong
    Cong, Gao
    Hai, Zhen
    PROCEEDINGS OF THE ACM WEB CONFERENCE 2022 (WWW'22), 2022, : 3061 - 3070
  • [5] MCMC convergence diagnostics: A reviewww
    Mengersen, KL
    Robert, CP
    Guihenneuc-Jouyaux, C
    BAYESIAN STATISTICS 6, 1999, : 415 - 440
  • [6] SPECIAL ISSUE ON ENTITY RESOLUTION Overview: The Criticality of Entity Resolution in Data and Information Quality
    Talburt, John R.
    ACM JOURNAL OF DATA AND INFORMATION QUALITY, 2013, 4 (02):
  • [7] A Bayesian Idealization of Entity Resolution
    Ferry, James P.
    Lo, Darren
    Seaquist, Thomas
    2015 18TH INTERNATIONAL CONFERENCE ON INFORMATION FUSION (FUSION), 2015, : 150 - 157
  • [8] Coreference Resolution with Entity Equalization
    Kantor, Ben
    Globerson, Amir
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 673 - 677
  • [9] Entity Resolution for Big Data
    Getoor, Lise
    Machanavajjhala, Ashwin
    19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 1525 - 1525
  • [10] Entity Resolution with Crowd Errors
    Verroios, Vasilis
    Garcia-Molina, Hector
    2015 IEEE 31ST INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2015, : 219 - 230