Automatic de-identification of textual documents in the electronic health record: a review of recent research

被引:182
|
作者
Meystre, Stephane M. [1 ]
Friedlin, F. Jeffrey [3 ]
South, Brett R. [1 ,2 ]
Shen, Shuying [1 ,2 ]
Samore, Matthew H. [1 ,2 ]
机构
[1] Univ Utah, Dept Biomed Informat, Salt Lake City, UT 84112 USA
[2] IDEAS Ctr SLCVA Healthcare Syst, Salt Lake City, UT USA
[3] Regenstrief Inst Inc, Med Informat, Indianapolis, IN USA
来源
关键词
OF-THE-ART; MEDICAL-RECORDS; CLINICAL DOCUMENTS;
D O I
10.1186/1471-2288-10-70
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: In the United States, the Health Insurance Portability and Accountability Act (HIPAA) protects the confidentiality of patient data and requires the informed consent of the patient and approval of the Internal Review Board to use data for research purposes, but these requirements can be waived if data is de-identified. For clinical data to be considered de-identified, the HIPAA "Safe Harbor" technique requires 18 data elements (called PHI: Protected Health Information) to be removed. The de-identification of narrative text documents is often realized manually, and requires significant resources. Well aware of these issues, several authors have investigated automated de-identification of narrative text documents from the electronic health record, and a review of recent research in this domain is presented here. Methods: This review focuses on recently published research (after 1995), and includes relevant publications from bibliographic queries in PubMed, conference proceedings, the ACM Digital Library, and interesting publications referenced in already included papers. Results: The literature search returned more than 200 publications. The majority focused only on structured data de-identification instead of narrative text, on image de-identification, or described manual de-identification, and were therefore excluded. Finally, 18 publications describing automated text de-identification were selected for detailed analysis of the architecture and methods used, the types of PHI detected and removed, the external resources used, and the types of clinical documents targeted. All text de-identification systems aimed to identify and remove person names, and many included other types of PHI. Most systems used only one or two specific clinical document types, and were mostly based on two different groups of methodologies: pattern matching and machine learning. Many systems combined both approaches for different types of PHI, but the majority relied only on pattern matching, rules, and dictionaries. Conclusions: In general, methods based on dictionaries performed better with PHI that is rarely mentioned in clinical text, but are more difficult to generalize. Methods based on machine learning tend to perform better, especially with PHI that is not mentioned in the dictionaries used. Finally, the issues of anonymization, sufficient performance, and "over-scrubbing" are discussed in this publication.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] A hybrid approach to automatic de-identification of psychiatric notes
    Lee, Hee-Jin
    Wu, Yonghui
    Zhang, Yaoyun
    Xu, Jun
    Xu, Hua
    Roberts, Kirk
    JOURNAL OF BIOMEDICAL INFORMATICS, 2017, 75 : S19 - S27
  • [22] Evaluating the state-of-the-art in automatic de-identification
    Uzuner, Oezlem
    Luo, Yuan
    Szolovits, Peter
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2007, 14 (05) : 550 - 563
  • [23] Automatic Classification of Research Documents using Textual Entailment
    Ojokoh, Bolanle Adefowoke
    Omisore, Olatunji Mumini
    Samuel, Oluwarotimi Williams
    PROCEEDINGS OF THE 15TH ACM/IEEE-CS JOINT CONFERENCE ON DIGITAL LIBRARIES (JCDL'15), 2015, : 251 - 252
  • [24] Is Multiclass Automatic Text De-Identification Worth the Effort?
    Duy Duc An Bui
    Redden, David T.
    Cimino, James J.
    METHODS OF INFORMATION IN MEDICINE, 2018, 57 (04) : 177 - 184
  • [25] A review of Automatic end-to-end De-Identification: Is High Accuracy the Only Metric?
    Yogarajan, Vithya
    Pfahringer, Bernhard
    Mayo, Michael
    APPLIED ARTIFICIAL INTELLIGENCE, 2020, 34 (03) : 251 - 269
  • [26] In the Name of Fairness: Assessing the Bias in Clinical Record De-identification
    Xiao, Yuxin
    Lim, Shulammite
    Pollard, Tom Joseph
    Ghassemi, Marzyeh
    PROCEEDINGS OF THE 6TH ACM CONFERENCE ON FAIRNESS, ACCOUNTABILITY, AND TRANSPARENCY, FACCT 2023, 2023, : 123 - 137
  • [27] A Review of the Role of Electronic Health Record in Genomic Research
    Parasuram Krishnamoorthy
    Deepansh Gupta
    Saurav Chatterjee
    Jessica Huston
    John J. Ryan
    Journal of Cardiovascular Translational Research, 2014, 7 : 692 - 700
  • [28] A Review of the Role of Electronic Health Record in Genomic Research
    Krishnamoorthy, Parasuram
    Gupta, Deepansh
    Chatterjee, Saurav
    Huston, Jessica
    Ryan, John J.
    JOURNAL OF CARDIOVASCULAR TRANSLATIONAL RESEARCH, 2014, 7 (08) : 692 - 700
  • [29] Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models
    El Azzouzi, Mohamed
    Coatrieux, Gouenou
    Bellafqira, Reda
    Delamarre, Denis
    Riou, Christine
    Oubenali, Naima
    Cabon, Sandie
    Cuggia, Marc
    Bouzille, Guillaume
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)
  • [30] Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models
    Mohamed El Azzouzi
    Gouenou Coatrieux
    Reda Bellafqira
    Denis Delamarre
    Christine Riou
    Naima Oubenali
    Sandie Cabon
    Marc Cuggia
    Guillaume Bouzillé
    BMC Medical Informatics and Decision Making, 24