An hybrid Machine Learning method for the de-identification of Un-Structured Narrative Clinical Text in Multi-Center Chinese Electronic Medical Records Data

被引:0
|
作者
Jin, Meng [1 ,2 ]
Zhang, Kai [3 ]
Yang, Yunhaonan [4 ]
Xie, Shuanglian [5 ]
Song, Kai [6 ]
Hu, Yonghua [1 ,4 ]
Bao, Xiaoyuan [1 ,2 ]
机构
[1] Peking Univ, Med Informat Ctr, Beijing, Peoples R China
[2] Natl Med Serv Data Ctr, Beijing, Peoples R China
[3] Peking Univ, Hlth Sci Ctr, Beijing, Peoples R China
[4] Peking Univ, Sch Publ Hlth, Beijing, Peoples R China
[5] Peking Univ, Clin Med Coll 5, Beijing, Peoples R China
[6] China Japan Friendship Hosp, Beijing, Peoples R China
关键词
component; Chinese electronic medical record; Un-structured; machine learning; corpora; multi-center; OF-THE-ART; ANONYMIZATION; INFORMATION;
D O I
10.1109/ICBK.2019.00023
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The premise of the full use of unstructured electronic medical records is to maintain the fully protection of a patient's information privacy. Presently, in prior of processing the electronic medical record date, identification and removing of relevant information which can be used to identify a patient is a research hotspot nowadays. There are very few methods in de identification of Chinese electronic medical records and their cross center performance is poor. Therefore, we develop a de-identification method which is a mixture of rule-based methods and machine learning methods. The method was tested on 700 electronic medical records from six hospitals. Five-fold cross test was used to evaluate the results of c5.0, Random Forest, SVM and XGBOOST. Leave-one-out test was used to evaluate CRF. And the F1 Measure of machine learning reached 91.18% in PHI_Names, 98.21% in PHI_MEDICALID, 95.74% in PHI_OTHERNFC, 97.14% in PHI_GEO, 89.19% in PHI_DATES, and 91.49% in PHI_TEL. And the F1 Measure of rule-based methods reached 93.00% in PHI_Names, 97.00% in PHI_MEDICALID, 97.00% in PHI_OTHERNFC, 97.00% in PHI_GEO, 96.00% in PHI_DATES, and 89.00% in PHI_TEL.
引用
收藏
页码:105 / 111
页数:7
相关论文
共 4 条
  • [1] A framework for de-identification of free-text data in electronic medical records enabling secondary use
    Mercorelli, Louis
    Nguyen, Harrison
    Gartell, Nicole
    Brookes, Martyn
    Morris, Jonathan
    Tam, Charmaine S.
    AUSTRALIAN HEALTH REVIEW, 2022, 46 (03) : 289 - 293
  • [2] De-identification of primary care electronic medical records free-text data in Ontario, Canada
    Karen Tu
    Julie Klein-Geltink
    Tezeta F Mitiku
    Chiriac Mihai
    Joel Martin
    BMC Medical Informatics and Decision Making, 10
  • [3] De-identification of primary care electronic medical records free-text data in Ontario, Canada
    Tu, Karen
    Klein-Geltink, Julie
    Mitiku, Tezeta F.
    Mihai, Chiriac
    Martin, Joel
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2010, 10
  • [4] Development of Automated Methods for Big Data to Achieve Compliance With IRB, Institutional, and Federal Requirements in the De-Identification of Narratives and Structured Data Focused on Safety Signals for Adverse Drug or Device Events From Electronic Medical Records
    West, Dennis P.
    Temps, William H.
    Tice, Debra G.
    Kelley, Michael S.
    Yates, Eileen M.
    Kiguradze, Tinatin
    Majewski, Sara
    Cashy, John P.
    Nardone, Beatrice
    Belknap, Steven M.
    JOURNAL OF EMPIRICAL RESEARCH ON HUMAN RESEARCH ETHICS, 2016, 11 (01) : 77 - 78