Automatic De-Identification of Medical Records with a Multilevel Hybrid Semi-Supervised Learning Approach

被引:0
|
作者
Nguyen Dong Phuong [1 ]
Vo Thi Ngoc Chau [2 ]
机构
[1] Vietnam Natl Univ, Ho Chi Minh City Univ Technol, Ton Duc Thang Univ, Ctr Appl Informat Technol, Ho Chi Minh City, Vietnam
[2] Vietnam Natl Univ, Ho Chi Minh city Univ Technol, Fac Comp Sci & Engn, Dept Informat Syst, Ho Chi Minh City, Vietnam
关键词
de-identijication; protected health information; electronic medical record; privacy preserving; multilevel hybrid semi-supervised learning; CLINICAL DOCUMENTS; SYSTEM;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In recent years, sharing electronic medical records (EMRs) for more researchers outside the associated institutions is significant. For privacy preservation of the corresponding patients and the associated institutions, a de-identification task on the EMRs to be shared is a must. Although the de-identification task has been considered with positive research outcomes worldwide, especially those from the i2b2 (Informatics for Integrating Biology and the Bedside) shared tasks in 2006 and 2014, the task has not yet been a solved problem and still needs more investigation realistically. In this paper, we propose an automatic de-identification solution in a multilevel hybrid semi-supervised learning paradigm with a key focus on correctly identifying protected health information (PHI) in the EMRs. Similar to the existing works, our work defines a hybrid approach by combining a machine learning-based method with a conditional random fields model and a rule-based method in a post-processing phase to handle the PHI types with disambiguity. Nevertheless, our work is more general and practical. First, it considers the structure complexity of each EMR so that each section can be treated properly for more correct PHI identification up to its structure complexity: structured, semi-structured, or un-structured. Second, each EMR is then examined in our approach at three different levels of granularity such as a token level in the supervised learning phase, an entity level in the rule-based post-processing phase, and a section level along with the structure complexity in the semi-supervised learning phase. Many various detail levels will give our approach a deeper look at each EMR for more effectiveness. Third, our solution is conducted in a self-training manner so that it can get started with a small annotated data set in practice and get more effective with new EMRs over time. Evaluated with the i2b2 data set in comparison with the related works, our solution is effective with better F-measure values for the AGE, LOCATION, and PHONE PHI types and comparable for the other PHI types.
引用
收藏
页码:43 / 48
页数:6
相关论文
共 50 条
  • [41] Semi-supervised feature learning for improving writer identification
    Chen, Shiming
    Wang, Yisong
    Lin, Chin-Teng
    Ding, Weiping
    Cao, Zehong
    INFORMATION SCIENCES, 2019, 482 : 156 - 170
  • [42] Product Bundle Identification using Semi-Supervised Learning
    Tzaban, Hen
    Guy, Ido
    Greenstein-Messica, Asnat
    Dagan, Arnon
    Rokach, Lior
    Shapira, Bracha
    PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, : 791 - 800
  • [43] SmartSenior: Automatic Content Personalization Through Semi-supervised Learning
    Heuiseok Lim
    Danial Hooshyar
    Hyesung Ji
    Seolhwa Lee
    Jaechoon Jo
    Wireless Personal Communications, 2019, 105 : 461 - 473
  • [44] Effective semi-supervised learning strategies for automatic sentence segmentation
    Dalva, Dogan
    Guz, Umit
    Gurkan, Hakan
    PATTERN RECOGNITION LETTERS, 2018, 105 : 76 - 86
  • [45] SmartSenior: Automatic Content Personalization Through Semi-supervised Learning
    Lim, Heuiseok
    Hooshyar, Danial
    Ji, Hyesung
    Lee, Seolhwa
    Jo, Jaechoon
    WIRELESS PERSONAL COMMUNICATIONS, 2019, 105 (02) : 461 - 473
  • [46] Improving automatic query classification via semi-supervised learning
    Beitzel, SM
    Jensen, EC
    Frieder, O
    Lewis, DD
    Chowdhury, A
    Kolcz, A
    Fifth IEEE International Conference on Data Mining, Proceedings, 2005, : 42 - 49
  • [47] Automatic Leaf Recognition Based on Deep Semi-Supervised Learning
    Wu H.
    Xiao F.
    Shi Z.
    Wen Z.
    Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics, 2023, 35 (10): : 1469 - 1478
  • [48] Conjunction of Active and Semi-Supervised Learning for Wireline Logs-Based Automatic Lithology Identification
    Hong, Zhong
    Yao, Jun
    Li, Kunhong
    Hu, Guangmin
    IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2022, 19
  • [49] A Hybrid Approach of Pattern Extraction and Semi-supervised Learning for Vietnamese Named Entity Recognition
    Vo, Duc-Thuan
    Ock, Cheol-Young
    COMPUTATIONAL COLLECTIVE INTELLIGENCE - TECHNOLOGIES AND APPLICATIONS, PT I, 2012, 7653 : 83 - 93
  • [50] A Semi-supervised Learning Approach for Microblog Sentiment Classification
    Yu, Zhiwei
    Wong, Raymond K.
    Chi, Chi-Hung
    Chen, Fang
    2015 IEEE INTERNATIONAL CONFERENCE ON SMART CITY/SOCIALCOM/SUSTAINCOM (SMARTCITY), 2015, : 339 - 344