Exploring the tradeoff between data privacy and utility with a clinical data analysis use case

被引:0
|
作者
Im, Eunyoung [1 ,2 ]
Kim, Hyeoneui [1 ,2 ,3 ]
Lee, Hyungbok [1 ,5 ]
Jiang, Xiaoqian [4 ]
Kim, Ju Han [5 ,6 ]
机构
[1] Seoul Natl Univ, Coll Nursing, Seoul, South Korea
[2] Seoul Natl Univ, Coll Nursing, Ctr World Leading Human Care Nurse Leaders Future, Seoul, South Korea
[3] Seoul Natl Univ, Res Inst Nursing Sci, Seoul, South Korea
[4] UTHealth, Sch Biomed Informat, Houston, TX USA
[5] Seoul Natl Univ Hosp, Seoul, South Korea
[6] Seoul Natl Univ, Coll Med, Seoul, South Korea
关键词
Data privacy; Data utility; Data de-identification; Clinical data analysis; ARX tool; K-ANONYMITY; BLOCKCHAIN; QUERIES;
D O I
10.1186/s12911-024-02545-9
中图分类号
R-058 [];
学科分类号
摘要
Background Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset's utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset's utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility.Methods Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two.Results All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores.Conclusions As the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data's intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] On the Tradeoff Between Privacy and Utility in Data Publishing
    Li, Tiancheng
    Li, Ninghui
    [J]. KDD-09: 15TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2009, : 517 - 525
  • [2] On the Tradeoff between Data-privacy and Utility for Data Publishing
    Liao, Wenjing
    He, Jianping
    Zhu, Shanying
    Chen, Cailian
    Guan, Xinping
    [J]. 2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018), 2018, : 779 - 786
  • [3] Granular data representation under privacy protection: Tradeoff between data utility and privacy via information granularity
    Zhang, Ge
    Zhu, Xiubin
    Yin, Li
    Pedrycz, Witold
    Li, Zhiwu
    [J]. APPLIED SOFT COMPUTING, 2022, 131
  • [4] Exploring the Tradeoff Between Privacy and Utility of Complete-count Census Data Using a Multiobjective Optimization Approach
    Lin, Yue
    Xiao, Ningchuan
    [J]. GEOGRAPHICAL ANALYSIS, 2024, 56 (03) : 427 - 450
  • [5] The Risk-Utility Tradeoff for Data Privacy Models
    Almasi, M. Moein
    Siddiqui, Taha R.
    Mohammed, Noman
    Hemmati, Hadi
    [J]. 2016 8TH IFIP INTERNATIONAL CONFERENCE ON NEW TECHNOLOGIES, MOBILITY AND SECURITY (NTMS), 2016,
  • [6] ANALYSIS OF THE BALANCE BETWEEN PRIVACY AND UTILITY IN DATA ACCESS
    Affonso, Elaine Parra
    de Oliveira, Sandra Cristina
    Goncalves Sant'Ana, Ricardo Cesar
    [J]. INFORMACAO & SOCIEDADE-ESTUDOS, 2017, 27 (01) : 81 - 92
  • [7] Guidance on the usability-privacy tradeoff for utility customer data aggregation
    Ruddell, Benjamin L.
    Cheng, Dan
    Fournier, Eric Daniel
    Pincetl, Stephanie
    Potter, Caryn
    Rushforth, Richard
    [J]. UTILITIES POLICY, 2020, 67
  • [8] Utility-Privacy Tradeoff Based on Random Data Obfuscation in Internet of Energy
    Guan, Zhitao
    Si, Guanlin
    Wu, Jun
    Zhu, Liehuang
    Zhang, Zijian
    Ma, Yinglong
    [J]. IEEE ACCESS, 2017, 5 : 3250 - 3262
  • [9] Synthetic data use: exploring use cases to optimise data utility
    James, Stefanie
    Harbron, Chris
    Branson, Janice
    Sundler, Mimmi
    [J]. Discover Artificial Intelligence, 2021, 1 (01):
  • [10] Tight Analysis of Privacy and Utility Tradeoff in Approximate Differential Privacy
    Geng, Quan
    Ding, Wei
    Guo, Ruiqi
    Kumar, Sanjiv
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 89 - 98