Measuring re-identification risk using a synthetic estimator to enable data sharing

被引:6
|
作者
Jiang, Yangdi [1 ,2 ]
Mosquera, Lucy [2 ]
Jiang, Bei [1 ]
Kong, Linglong [1 ]
El Emam, Khaled [2 ,3 ,4 ]
机构
[1] Univ Alberta, Dept Math & Stat Sci, Edmonton, AB, Canada
[2] Repl Analyt Ltd, Ottawa, ON, Canada
[3] Univ Ottawa, Sch Epidemiol & Publ Hlth, Ottawa, ON, Canada
[4] Childrens Hosp Eastern Ontario, Res Inst, Ottawa, ON, Canada
来源
PLOS ONE | 2022年 / 17卷 / 06期
基金
加拿大自然科学与工程研究理事会;
关键词
COPULA MODELS;
D O I
10.1371/journal.pone.0269097
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background One common way to share health data for secondary analysis while meeting increasingly strict privacy regulations is to de-identify it. To demonstrate that the risk of re-identification is acceptably low, re-identification risk metrics are used. There is a dearth of good risk estimators modeling the attack scenario where an adversary selects a record from the microdata sample and attempts to match it with individuals in the population. Objectives Develop an accurate risk estimator for the sample-to-population attack. Methods A type of estimator based on creating a synthetic variant of a population dataset was developed to estimate the re-identification risk for an adversary performing a sample-to-population attack. The accuracy of the estimator was evaluated through a simulation on four different datasets in terms of estimation error. Two estimators were considered, a Gaussian copula and a d-vine copula. They were compared against three other estimators proposed in the literature. Results Taking the average of the two copula estimates consistently had a median error below 0.05 across all sampling fractions and true risk values. This was significantly more accurate than existing methods. A sensitivity analysis of the estimator accuracy based on variation in input parameter accuracy provides further application guidance. The estimator was then used to assess re-identification risk and de-identify a large Ontario COVID-19 behavioral survey dataset. Conclusions The average of two copula estimators consistently provides the most accurate re-identification risk estimate and can serve as a good basis for managing privacy risks when data are de-identified and shared.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Evaluation of Re-identification Risk using Anonymization and Differential Privacy in Healthcare
    Ratra, Ritu
    Gulia, Preeti
    Gill, Nasib Singh
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (02) : 563 - 570
  • [22] Evaluation of Re-identification Risk using Anonymization and Differential Privacy in Healthcare
    Ratra R.
    Gulia P.
    Gill N.S.
    International Journal of Advanced Computer Science and Applications, 2022, 13 (02): : 563 - 570
  • [23] Re-identification of Anonymized CDR datasets Using Social Network Data
    Cecaj, Alket
    Mamei, Marco
    Bicocchi, Nicola
    2014 IEEE INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING AND COMMUNICATIONS WORKSHOPS (PERCOM WORKSHOPS), 2014, : 237 - 242
  • [24] Risk of re-identification of epigenetic methylation data: a more nuanced response is needed
    Yann Joly
    Stephanie OM Dyke
    Warren A Cheung
    Mark A Rothstein
    Tomi Pastinen
    Clinical Epigenetics, 2015, 7
  • [25] Zipf Distribution Model for Quantifying Risk of Re-identification from Trajectory Data
    Kikuchi, Hiroaki
    Takahashi, Katsumi
    2015 THIRTEENTH ANNUAL CONFERENCE ON PRIVACY, SECURITY AND TRUST (PST), 2015, : 14 - 21
  • [26] Privacy Risk Evaluation of Re-identification of Pseudonyms
    Takeuchi, Yuma
    Kitajima, Shogo
    Fukushima, Kazuya
    Mambo, Masahiro
    2019 14TH ASIA JOINT CONFERENCE ON INFORMATION SECURITY (ASIAJCIS 2019), 2019, : 165 - 172
  • [27] A Re-identification Risk-based Anonymization Framework for Data Analytics Platforms
    Silva, Hebert
    Basso, Tania
    Moraes, Regina
    Elia, Donatello
    Fiore, Sandro
    2018 14TH EUROPEAN DEPENDABLE COMPUTING CONFERENCE (EDCC 2018), 2018, : 101 - 106
  • [28] Risk of re-identification of epigenetic methylation data: a more nuanced response is needed
    Joly, Yann
    Dyke, Stephanie O. M.
    Cheung, Warren A.
    Rothstein, Mark A.
    Pastinen, Tomi
    CLINICAL EPIGENETICS, 2015, 7
  • [29] Re-Identification Risk Based Security Controls
    Di Cerbo, Francesco
    Trabelsi, Slim
    ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS: OTM 2014 WORKSHOPS, 2014, 8842 : 99 - 107
  • [30] Re-identification potential of structured health data
    Drechsler, Joerg
    Pauly, Hannah
    BUNDESGESUNDHEITSBLATT-GESUNDHEITSFORSCHUNG-GESUNDHEITSSCHUTZ, 2024, 67 (02) : 164 - 170