Automatic De-Identification of Medical Records with a Multilevel Hybrid Semi-Supervised Learning Approach

被引:0
|
作者
Nguyen Dong Phuong [1 ]
Vo Thi Ngoc Chau [2 ]
机构
[1] Vietnam Natl Univ, Ho Chi Minh City Univ Technol, Ton Duc Thang Univ, Ctr Appl Informat Technol, Ho Chi Minh City, Vietnam
[2] Vietnam Natl Univ, Ho Chi Minh city Univ Technol, Fac Comp Sci & Engn, Dept Informat Syst, Ho Chi Minh City, Vietnam
关键词
de-identijication; protected health information; electronic medical record; privacy preserving; multilevel hybrid semi-supervised learning; CLINICAL DOCUMENTS; SYSTEM;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, sharing electronic medical records (EMRs) for more researchers outside the associated institutions is significant. For privacy preservation of the corresponding patients and the associated institutions, a de-identification task on the EMRs to be shared is a must. Although the de-identification task has been considered with positive research outcomes worldwide, especially those from the i2b2 (Informatics for Integrating Biology and the Bedside) shared tasks in 2006 and 2014, the task has not yet been a solved problem and still needs more investigation realistically. In this paper, we propose an automatic de-identification solution in a multilevel hybrid semi-supervised learning paradigm with a key focus on correctly identifying protected health information (PHI) in the EMRs. Similar to the existing works, our work defines a hybrid approach by combining a machine learning-based method with a conditional random fields model and a rule-based method in a post-processing phase to handle the PHI types with disambiguity. Nevertheless, our work is more general and practical. First, it considers the structure complexity of each EMR so that each section can be treated properly for more correct PHI identification up to its structure complexity: structured, semi-structured, or un-structured. Second, each EMR is then examined in our approach at three different levels of granularity such as a token level in the supervised learning phase, an entity level in the rule-based post-processing phase, and a section level along with the structure complexity in the semi-supervised learning phase. Many various detail levels will give our approach a deeper look at each EMR for more effectiveness. Third, our solution is conducted in a self-training manner so that it can get started with a small annotated data set in practice and get more effective with new EMRs over time. Evaluated with the i2b2 data set in comparison with the related works, our solution is effective with better F-measure values for the AGE, LOCATION, and PHONE PHI types and comparable for the other PHI types.
引用
收藏
页码:43 / 48
页数:6
相关论文
共 50 条
  • [1] A Semi-supervised Approach for De-identification of Swedish Clinical Text
    Berg, Hanna
    Dalianis, Hercules
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4444 - 4450
  • [2] Semi-supervised learning of multi-factor models for face de-identification
    Gross, Ralph
    Sweeney, Latanya
    de la Torre, Fernando
    Baker, Simon
    2008 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-12, 2008, : 219 - +
  • [3] A hybrid approach to automatic de-identification of psychiatric notes
    Lee, Hee-Jin
    Wu, Yonghui
    Zhang, Yaoyun
    Xu, Jun
    Xu, Hua
    Roberts, Kirk
    JOURNAL OF BIOMEDICAL INFORMATICS, 2017, 75 : S19 - S27
  • [4] CRFs based de-identification of medical records
    He, Bin
    Guan, Yi
    Cheng, Jianyi
    Cen, Keting
    Hua, Wenlan
    JOURNAL OF BIOMEDICAL INFORMATICS, 2015, 58 : S39 - S46
  • [5] A Semi-Supervised Learning Approach for Identification of Piecewise Affine Systems
    Du, Yingwei
    Liu, Fangzhou
    Qiu, Jianbin
    Buss, Martin
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2020, 67 (10) : 3521 - 3532
  • [6] A semi-supervised learning approach towards automatic wireless technology recognition
    Camelo, Miguel
    Shahid, Adnan
    Fontaine, Jaron
    de Figueiredo, Felipe Augusto Pereira
    De Poorter, Eli
    Moerman, Ingrid
    Latre, Steven
    2019 IEEE INTERNATIONAL SYMPOSIUM ON DYNAMIC SPECTRUM ACCESS NETWORKS (DYSPAN), 2019, : 420 - 429
  • [7] A Semi-Supervised Approach for Gender Identification
    Soler-Company, Juan
    Wanner, Leo
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1282 - 1287
  • [8] A topological approach for semi-supervised learning
    Ines, A.
    Dominguez, C.
    Heras, J.
    Mata, G.
    Rubio, J.
    JOURNAL OF COMPUTATIONAL SCIENCE, 2024, 82
  • [9] A New Graph Semi-Supervised Learning Method for Medical Image Automatic Annotation
    Bi, Jing
    Yin, Shoulin
    IEEE 2018 INTERNATIONAL CONGRESS ON CYBERMATICS / 2018 IEEE CONFERENCES ON INTERNET OF THINGS, GREEN COMPUTING AND COMMUNICATIONS, CYBER, PHYSICAL AND SOCIAL COMPUTING, SMART DATA, BLOCKCHAIN, COMPUTER AND INFORMATION TECHNOLOGY, 2018, : 43 - 46
  • [10] Rapidly retargetable approaches to de-identification in medical records
    Wellner, Ben
    Huyck, Marr
    Mardis, Scott
    Aberdeen, John
    Morgan, Alex
    Peshkin, Leonid
    Yeh, Alex
    Hitzeman, Janet
    Hirschman, Lynette
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2007, 14 (05) : 564 - 573