A unified framework for evaluating the risk of re-identification of text de-identification tools

被引:11
|
作者
Scaiano, Martin [1 ,2 ]
Middleton, Grant [2 ]
Arbuckle, Luk [4 ]
Kolhatkar, Varada [1 ,2 ]
Peyton, Liam [1 ]
Dowling, Moira [5 ]
Gipson, Debbie S. [6 ]
El Emam, Khaled [1 ,2 ,3 ,4 ]
机构
[1] Univ Ottawa, Sch Elect Engn & Comp Sci, Ottawa, ON, Canada
[2] Privacy Analyt Inc, Ottawa, ON, Canada
[3] Univ Ottawa, Dept Pediat, Ottawa, ON, Canada
[4] Eastern Ontario Res Inst, Childrens Hosp, Ottawa, ON, Canada
[5] Univ Michigan, Sch Med, Res Off, Michigan Inst Data Sci MIDAS, Ann Arbor, MI 48109 USA
[6] Univ Michigan, Dept Pediat, Ann Arbor, MI 48109 USA
基金
加拿大自然科学与工程研究理事会;
关键词
De-identification; Re-identification risk; Medical text; Evaluation framework; Natural language processing; Data sharing;
D O I
10.1016/j.jbi.2016.07.015
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Objectives: It has become regular practice to de-identify unstructured medical text for use in research using automatic methods, the goal of which is to remove patient identifying information to minimize re-identification risk. The metrics commonly used to determine if these systems are performing well do not accurately reflect the risk of a patient being re-identified. We therefore developed a framework for measuring the risk of re-identification associated with textual data releases. Methods: We apply the proposed evaluation framework to a data set from the University of Michigan Medical School. Our risk assessment results are then compared with those that would be obtained using a typical contemporary micro-average evaluation of recall in order to illustrate the difference between the proposed evaluation framework and the current baseline method. Results: We demonstrate how this framework compares against common measures of the re-identification risk associated with an automated text de-identification process. For the probability of re-identification using our evaluation framework we obtained a mean value for direct identifiers of 0.0074 and a mean value for quasi-identifiers of 0.0022. The 95% confidence interval for these estimates were below the relevant thresholds. The threshold for direct identifier risk was based on previously used approaches in the literature. The threshold for quasi-identifiers was determined based on the context of the data release following commonly used de-identification criteria for structured data. Discussion: Our framework attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions that are made using the more traditional evaluation approach. It therefore provides a more realistic estimate of the true probability of re-identification. Conclusions: This framework should be used as a basis for computing re-identification risk in order to more realistically evaluate future text de-identification tools. (C) 2016 The Authors. Published by Elsevier inc.
引用
收藏
页码:174 / 183
页数:10
相关论文
共 50 条
  • [31] Automated de-identification of free-text medical records
    Ishna Neamatullah
    Margaret M Douglass
    Li-wei H Lehman
    Andrew Reisner
    Mauricio Villarroel
    William J Long
    Peter Szolovits
    George B Moody
    Roger G Mark
    Gari D Clifford
    BMC Medical Informatics and Decision Making, 8
  • [32] Automated de-identification of free-text medical records
    Neamatullah, Ishna
    Douglass, Margaret M.
    Lehman, Li-wei H.
    Reisner, Andrew
    Villarroel, Mauricio
    Long, William J.
    Szolovits, Peter
    Moody, George B.
    Mark, Roger G.
    Clifford, Gari D.
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2008, 8 (1)
  • [33] Effects of personal identifier resynthesis on clinical text de-identification
    Yeniterzi, Reyyan
    Aberdeen, John
    Bayer, Samuel
    Wellner, Ben
    Hirschman, Lynette
    Malin, Bradley
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2010, 17 (02) : 159 - 168
  • [34] Evaluating the risk of patient re-identification from adverse drug event reports
    Khaled El Emam
    Fida K Dankar
    Angelica Neisa
    Elizabeth Jonker
    BMC Medical Informatics and Decision Making, 13
  • [35] Evaluating the risk of patient re-identification from adverse drug event reports
    El Emam, Khaled
    Dankar, Fida K.
    Neisa, Angelica
    Jonker, Elizabeth
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2013, 13
  • [36] Re-Identification in Urban Scenarios: A Review of Tools and Methods
    Oliveira, Hugo S.
    Machado, Jose J. M.
    Tavares, Joao Manuel R. S.
    APPLIED SCIENCES-BASEL, 2021, 11 (22):
  • [37] Evaluating common de-identification heuristics for personal health information
    El Emam, Khaled
    Jabbouri, Sam
    Sams, Scott
    Drouet, Youenn
    Power, Michael
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2006, 8 (04) : e28
  • [38] Unified Framework for Automated Person Re-identification and Camera Network Topology Inference in Camera Networks
    Cho, Yeong-Jun
    Park, Jae-Han
    Kim, Su-A
    Lee, Kyuewang
    Yoon, Kuk-Jin
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2017), 2017, : 2601 - 2607
  • [39] UnifiedSC: a unified framework via collaborative optimization for multi-task person re-identification
    Tongzhen Si
    Fazhi He
    Penglei Li
    Applied Intelligence, 2024, 54 : 2962 - 2975
  • [40] UnifiedSC: a unified framework via collaborative optimization for multi-task person re-identification
    Si, Tongzhen
    He, Fazhi
    Li, Penglei
    APPLIED INTELLIGENCE, 2024, 54 (04) : 2962 - 2975