A unified framework for evaluating the risk of re-identification of text de-identification tools

被引:11
|
作者
Scaiano, Martin [1 ,2 ]
Middleton, Grant [2 ]
Arbuckle, Luk [4 ]
Kolhatkar, Varada [1 ,2 ]
Peyton, Liam [1 ]
Dowling, Moira [5 ]
Gipson, Debbie S. [6 ]
El Emam, Khaled [1 ,2 ,3 ,4 ]
机构
[1] Univ Ottawa, Sch Elect Engn & Comp Sci, Ottawa, ON, Canada
[2] Privacy Analyt Inc, Ottawa, ON, Canada
[3] Univ Ottawa, Dept Pediat, Ottawa, ON, Canada
[4] Eastern Ontario Res Inst, Childrens Hosp, Ottawa, ON, Canada
[5] Univ Michigan, Sch Med, Res Off, Michigan Inst Data Sci MIDAS, Ann Arbor, MI 48109 USA
[6] Univ Michigan, Dept Pediat, Ann Arbor, MI 48109 USA
基金
加拿大自然科学与工程研究理事会;
关键词
De-identification; Re-identification risk; Medical text; Evaluation framework; Natural language processing; Data sharing;
D O I
10.1016/j.jbi.2016.07.015
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objectives: It has become regular practice to de-identify unstructured medical text for use in research using automatic methods, the goal of which is to remove patient identifying information to minimize re-identification risk. The metrics commonly used to determine if these systems are performing well do not accurately reflect the risk of a patient being re-identified. We therefore developed a framework for measuring the risk of re-identification associated with textual data releases. Methods: We apply the proposed evaluation framework to a data set from the University of Michigan Medical School. Our risk assessment results are then compared with those that would be obtained using a typical contemporary micro-average evaluation of recall in order to illustrate the difference between the proposed evaluation framework and the current baseline method. Results: We demonstrate how this framework compares against common measures of the re-identification risk associated with an automated text de-identification process. For the probability of re-identification using our evaluation framework we obtained a mean value for direct identifiers of 0.0074 and a mean value for quasi-identifiers of 0.0022. The 95% confidence interval for these estimates were below the relevant thresholds. The threshold for direct identifier risk was based on previously used approaches in the literature. The threshold for quasi-identifiers was determined based on the context of the data release following commonly used de-identification criteria for structured data. Discussion: Our framework attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions that are made using the more traditional evaluation approach. It therefore provides a more realistic estimate of the true probability of re-identification. Conclusions: This framework should be used as a basis for computing re-identification risk in order to more realistically evaluate future text de-identification tools. (C) 2016 The Authors. Published by Elsevier inc.
引用
收藏
页码:174 / 183
页数:10
相关论文
共 50 条
  • [1] Deep Metric Learning for Person Re-Identification and De-Identification
    Filkovic, Ivan
    Kalafatic, Zoran
    Hrkac, Tomislav
    2016 39TH INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2016, : 1360 - 1364
  • [2] Discovery of De-identification Policies Considering Re-identification Risks And Information Loss
    Ruan, He-Ming
    Tsai, Ming-Hwa
    Huang, Yen-Nun
    Liao, Yen-Hua
    Lei, Chin-Laung
    2015 10TH ASIA JOINT CONFERENCE ON INFORMATION SECURITY (ASIAJCIS), 2015, : 69 - 76
  • [3] Data De-identification Framework
    Oh, Junhyoung
    Lee, Kyungho
    CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 74 (02): : 3579 - 3606
  • [4] Unified Framework for Joint Attribute Classification and Person Re-identification
    Sun, Chenxin
    Jiang, Na
    Zhang, Lei
    Wang, Yuehua
    Wu, Wei
    Zhou, Zhong
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2018, PT I, 2018, 11139 : 637 - 647
  • [5] Continuous and Unified Person Re-Identification
    Mao, Zhu
    Wang, Xiao
    Xu, Xin
    Wang, Zheng
    Lin, Chia-Wen
    IEEE SIGNAL PROCESSING LETTERS, 2022, 29 : 1983 - 1987
  • [6] A Game Theoretic Framework for Analyzing Re-Identification Risk
    Wan, Zhiyu
    Vorobeychik, Yevgeniy
    Xia, Weiyi
    Clayton, Ellen Wright
    Kantarcioglu, Murat
    Ganta, Ranjit
    Heatherly, Raymond
    Malin, Bradley A.
    PLOS ONE, 2015, 10 (03):
  • [7] A Unified Generative Adversarial Framework for Image Generation and Person Re-identification
    Li, Yaoyu
    Zhang, Tianzhu
    Duan, Lingyu
    Xu, Changsheng
    PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, : 163 - 172
  • [8] Evaluating the performance of existing automated text de-identification tools on patient progress notes from Australian general practice
    El-Hayek, Carol
    Barzegar, Siamak
    Faux, Noel
    Doyle, Kim
    Pillai, Priyanka
    Mutch, Simon
    Vaisey, Alaina
    Ward, Roger
    Sanci, Lena
    Hocking, Jane
    Verspoor, Karin
    Boyle, Douglas
    AUSTRALIAN JOURNAL OF PRIMARY HEALTH, 2022, 28 (04) : XIII - XIII
  • [9] Evaluating Features for Person Re-Identification
    Wang, Jiabao
    Li, Hang
    Li, Yang
    Xu, Yulong
    Miao, Zhuang
    2016 IEEE INTERNATIONAL CONFERENCE ON SIGNAL AND IMAGE PROCESSING (ICSIP), 2016, : 214 - 219
  • [10] A survey on UK researchers' views regarding their experiences with the de-identification, anonymisation, release methods and re-identification risk estimation for clinical trial datasets
    Rodriguez, Aryelly
    Lewis, Steff C.
    Eldridge, Sandra
    Jackson, Tracy
    Weir, Christopher J.
    CLINICAL TRIALS, 2025, 22 (01) : 11 - 23