Measuring re-identification risk using a synthetic estimator to enable data sharing

被引:6
|
作者
Jiang, Yangdi [1 ,2 ]
Mosquera, Lucy [2 ]
Jiang, Bei [1 ]
Kong, Linglong [1 ]
El Emam, Khaled [2 ,3 ,4 ]
机构
[1] Univ Alberta, Dept Math & Stat Sci, Edmonton, AB, Canada
[2] Repl Analyt Ltd, Ottawa, ON, Canada
[3] Univ Ottawa, Sch Epidemiol & Publ Hlth, Ottawa, ON, Canada
[4] Childrens Hosp Eastern Ontario, Res Inst, Ottawa, ON, Canada
来源
PLOS ONE | 2022年 / 17卷 / 06期
基金
加拿大自然科学与工程研究理事会;
关键词
COPULA MODELS;
D O I
10.1371/journal.pone.0269097
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. Objectives Develop an accurate risk estimator for the sample-to-population attack. Methods A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. Results Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. Conclusions The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
    Morgan Guillaudeux
    Olivia Rousseau
    Julien Petot
    Zineb Bennis
    Charles-Axel Dein
    Thomas Goronflot
    Nicolas Vince
    Sophie Limou
    Matilde Karakachoff
    Matthieu Wargny
    Pierre-Antoine Gourraud
    npj Digital Medicine, 6
  • [2] Patient-centric synthetic data generation, no reason to risk re-identification in biomedical data analysis
    Guillaudeux, Morgan
    Rousseau, Olivia
    Petot, Julien
    Bennis, Zineb
    Dein, Charles-Axel
    Goronflot, Thomas
    Vince, Nicolas
    Limou, Sophie
    Karakachoff, Matilde
    Wargny, Matthieu
    Gourraud, Pierre-Antoine
    NPJ DIGITAL MEDICINE, 2023, 6 (01)
  • [3] Self directed training of person re-identification with synthetic data
    Dant, Aaron P.
    Kacenjar, Steve T.
    Neely, Ronald
    APPLICATIONS OF MACHINE LEARNING 2021, 2021, 11843
  • [4] Object Re-Identification with Synthetic Training Data in Industrial Environments
    Duemmel, Johannes
    Gao, Xue
    2021 27TH INTERNATIONAL CONFERENCE ON MECHATRONICS AND MACHINE VISION IN PRACTICE (M2VIP), 2021,
  • [5] Estimating the re-identification risk of clinical data sets
    Dankar, Fida Kamal
    El Emam, Khaled
    Neisa, Angelica
    Roffey, Tyson
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2012, 12
  • [6] Estimating the re-identification risk of clinical data sets
    Fida Kamal Dankar
    Khaled El Emam
    Angelica Neisa
    Tyson Roffey
    BMC Medical Informatics and Decision Making, 12
  • [7] On the Effectiveness of Synthetic Data Sets for Training Person Re-identification Models
    Delussu, Rita
    Putzu, Lorenzo
    Fumera, Giorgio
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1208 - 1214
  • [8] Responsible Data Sharing: Identifying and Remedying Possible Re-Identification of Human Participants
    Morehouse, Kirsten N.
    Kurdi, Benedek
    Nosek, Brian A.
    AMERICAN PSYCHOLOGIST, 2024,
  • [9] The effect of kinship in re-identification attacks against genomic data sharing beacons
    Ayoz, Kerem
    Aysen, Miray
    Ayday, Erman
    Cicek, A. Ercument
    BIOINFORMATICS, 2020, 36 : I903 - I910
  • [10] Measuring risk of re-identification in microdata: State-of-the art and new directions
    Shlomo, Natalie
    Skinner, Chris
    JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES A-STATISTICS IN SOCIETY, 2022, 185 (04) : 1644 - 1662