A mixture model for the analysis of data derived from record linkage

被引:10
|
作者
Hof, M. H. P. [1 ]
Zwinderman, A. H. [1 ]
机构
[1] Univ Amsterdam, Acad Med Ctr, Dept Clin Epidemiol Biostat & Bioinformat, NL-1105 AZ Amsterdam, Netherlands
关键词
probabilistic record linkage; EM algorithm; large data sources; combining multiple registries; partially identifying variables; LINKED DATA; DENSITY-ESTIMATION; REGRESSION;
D O I
10.1002/sim.6315
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Combining information from two data sources depends on finding records that belong to the same individual (matches). Sometimes, unique identifiers per individual are not available, and we have to rely on partially identifying variables that are registered in both data sources. A risk of relying on these variables is that some records from both datasets are wrongly linked to each other, which introduces bias in further regression analyses. In this paper, we propose a mixture model where we treat the indicator whether records belong to the same individual as missing. Each pair of records from both datasets contributes independently to a pairwise pseudo-likelihood, which we maximize with an expectation-maximization algorithm. Each part of the pseudo-likelihood is parameterized by the appropriate (parametric) density function. Moreover, some structures of the data allow for simplifying assumptions, which makes the pseudo-likelihood considerably easier to parameterize. Because the optimization requires a product over all combinations of records from both datasets, we suggest a procedure that summarizes information from highly unlikely matches. With simulations, we showed that the new approach produces accurate estimates in different linkage scenarios. Moreover, the estimator remained accurate in scenarios where previously proposed analysis approaches give biased results. We applied the method to estimation of the association between pregnancy duration of the first and second born children from the same mother from a register without mother identifier. Copyright (c) 2014 John Wiley & Sons, Ltd.
引用
收藏
页码:74 / 92
页数:19
相关论文
共 50 条
  • [1] A Probabilistic Record Linkage Model for Survival Data
    Hof, Michel H.
    Ravelli, Anita C.
    Zwinderman, Aeilko H.
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2017, 112 (520) : 1504 - 1515
  • [2] A Bayesian record linkage model incorporating relational data
    Sosa, Juan
    Rodriguez, Abel
    APPLIED STOCHASTIC MODELS IN BUSINESS AND INDUSTRY, 2023, 39 (06) : 755 - 771
  • [3] Hybrid Record Linkage Model for Integrating Marine Data
    Fitrianah, Devi
    Wasito, Ito
    INTERNATIONAL CONFERENCE ON ADVANCES SCIENCE AND CONTEMPORARY ENGINEERING 2012, 2012, 50 : 926 - 932
  • [4] Results from simulated data sets: probabilistic record linkage outperforms deterministic record linkage
    Tromp, Miranda
    Ravelli, Anita C.
    Bonsel, Gouke J.
    Hasman, Arie
    Reitsma, Johannes B.
    JOURNAL OF CLINICAL EPIDEMIOLOGY, 2011, 64 (05) : 565 - 572
  • [5] RECORD LINKAGE AND DATA PROTECTION
    不详
    LANCET, 1985, 1 (8423): : 294 - 294
  • [6] Iterative automated record linkage using mixture models
    Larsen, MD
    Rubin, DB
    JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2001, 96 (453) : 32 - 41
  • [7] NEW DATA FROM OLD - EPIDEMIOLOGY AND RECORD-LINKAGE
    NEUTEL, CI
    JOHANSEN, HL
    WALOP, W
    PROGRESS IN FOOD AND NUTRITION SCIENCE, 1991, 15 (03): : 85 - 116
  • [8] The Impact of Record Linkage on Learning from Feature Partitioned Data
    Nock, Richard
    Hardy, Stephen
    Henecka, Wilko
    Ivey-Law, Hamish
    Nabaglo, Jakub
    Patrini, Giorgio
    Smith, Guillaume
    Thorne, Brian
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [9] ASSOCIATION OF CHILDRENS DISEASES IN FAMILIES FROM RECORD LINKAGE DATA
    SIMPSON, NE
    ALLESLEV, LJ
    CANADIAN JOURNAL OF GENETICS AND CYTOLOGY, 1972, 14 (04): : 789 - 800
  • [10] Metric-based data mining model for genealogical record linkage
    Ivie, Stephen
    Pixton, Burdette
    Giraud-Carrier, Christophe
    IRI 2007: PROCEEDINGS OF THE 2007 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2007, : 538 - +