On Anonymizing Medical Microdata with Large-Scale Missing Values - A Case Study with the FAERS Dataset

被引:0
|
作者
Hsiao, Mei-Hui [1 ]
Lin, Wen-Yang [1 ]
Hsu, Kuang-Yung [1 ]
Shen, Zih-Xun [1 ]
机构
[1] Natl Univ Kaohsiung, Dept Comp Sci & Informat Engn, Kaohsiung, Taiwan
关键词
D O I
10.1109/embc.2019.8857025
中图分类号
R318 [生物医学工程];
学科分类号
0831 ;
摘要
As big data analysis becomes one of the main driving forces for productivity and economic growth, the concern of individual privacy disclosure increases as well, especially for applications accessing medical or health data that contain personal information. Most contemporary techniques for privacy preserving data publishing follow a simple assumption-the data of concern is complete, i.e., containing no missing values, which however is not the case in the real world. This paper presents our endeavors on inspecting the effect of missing values upon medical data privacy. In particular, we inspected the US FAERS dataset, a public dataset containing adverse drug events released by US FDA. Following the presumption of current anonymization paradigm-the data should contain no missing values, we investigated three intuitive strategies, including or excluding missing values or executing imputation, to anonymize the FAERS dataset. Our results demonstrate the awkwardness of these intuitive strategies in handling data with a massive amount of missing values. Accordingly, we propose a new strategy, consolidation, and the corresponding privacy protection model and anonymization algorithm. Experimental results show that our method can prevent privacy disclosure and sustain the data utility for ADR signal detection.
引用
收藏
页码:6505 / 6508
页数:4
相关论文
共 50 条
  • [21] A large-scale and global car dataset for verification
    Hu, Lingji
    Luo, Xingcheng
    Deng, Jianhua
    Lai, Fengjie
    Hu, Jian
    Yu, Yongbin
    PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ELECTRONIC TECHNOLOGY, 2016, 48 : 49 - 52
  • [22] VoxCeleb: a large-scale speaker identification dataset
    Nagrani, Arsha
    Chung, Joon Son
    Zisserman, Andrew
    18TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2017), VOLS 1-6: SITUATED INTERACTION, 2017, : 2616 - 2620
  • [23] A Large-Scale Dataset for Empathetic Response Generation
    Welivita, Anuradha
    Xie, Yubo
    Pu, Pearl
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 1251 - 1264
  • [24] A large-scale hyperspectral dataset for flower classification
    Zheng, Yongrong
    Zhang, Tao
    Fu, Ying
    KNOWLEDGE-BASED SYSTEMS, 2022, 236
  • [25] Dungeons and Data: A Large-Scale NetHack Dataset
    Hambro, Eric
    Raileanu, Roberta
    Rothermel, Danielle
    Mella, Vegard
    Rocktaschel, Tim
    Kuttler, Heinrich
    Murray, Naila
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [26] MedShapeNet - a large-scale dataset of 3D medical shapes for computer vision
    Li, Jianning
    Zhou, Zongwei
    Yang, Jiancheng
    Pepe, Antonio
    Gsaxner, Christina
    Luijten, Gijs
    Qu, Chongyu
    Zhang, Tiezheng
    Chen, Xiaoxi
    Li, Wenxuan
    Wodzinski, Marek
    Friedrich, Paul
    Xie, Kangxian
    Jin, Yuan
    Ambigapathy, Narmada
    Nasca, Enrico
    Solak, Naida
    Melito, Gian Marco
    Viet Duc Vu
    Memon, Afaque R.
    Schlachta, Christopher
    De Ribaupierre, Sandrine
    Patel, Rajnikant
    Eagleson, Roy
    Chen, Xiaojun
    Maechler, Heinrich
    Kirschke, Jan Stefan
    de la Rosa, Ezequiel
    Christ, Patrick Ferdinand
    Li, Hongwei Bran
    Ellis, David G.
    Aizenberg, Michele R.
    Gatidis, Sergios
    Kuestner, Thomas
    Shusharina, Nadya
    Heller, Nicholas
    Andrearczyk, Vincent
    Depeursinge, Adrien
    Hatt, Mathieu
    Sekuboyina, Anjany
    Loeffler, Maximilian T.
    Liebl, Hans
    Dorent, Reuben
    Vercauteren, Tom
    Shapey, Jonathan
    Kujawa, Aaron
    Cornelissen, Stefan
    Langenhuizen, Patrick
    Ben-Hamadou, Achraf
    Rekik, Ahmed
    BIOMEDICAL ENGINEERING-BIOMEDIZINISCHE TECHNIK, 2025, 70 (01): : 71 - 90
  • [27] LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset
    Li, Haitao
    Shao, Yunqiu
    Wu, Yueyue
    Ai, Qingyao
    Ma, Yixiao
    Liu, Yiqun
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2251 - 2260
  • [28] Working with missing data in large-scale assessments
    Francis Huang
    Brian Keller
    Large-scale Assessments in Education, 13 (1)
  • [29] SDFC dataset: a large-scale benchmark dataset for hyperspectral image classification
    Sun, Liwei
    Zhang, Junjie
    Li, Jia
    Wang, Yueming
    Zeng, Dan
    OPTICAL AND QUANTUM ELECTRONICS, 2023, 55 (02)
  • [30] The Blackbird Dataset: A Large-Scale Dataset for UAV Perception in Aggressive Flight
    Antonini, Amado
    Guerra, Winter
    Murali, Varun
    Sayre-McCord, Thomas
    Karaman, Sertac
    PROCEEDINGS OF THE 2018 INTERNATIONAL SYMPOSIUM ON EXPERIMENTAL ROBOTICS, 2020, 11 : 130 - 139