Comparison of different feature extraction methods for applicable automated ICD coding

被引:3
|
作者
Zhao Shuai [1 ]
Diao Xiaolin [1 ]
Yuan Jing [2 ]
Huo Yanni [1 ]
Cui Meng [1 ]
Wang Yuxin [1 ]
Zhao Wei [2 ]
机构
[1] Chinese Acad Med Sci & Peking Union Med Coll, Dept Informat Ctr, Fuwai Hosp, Beijing, Peoples R China
[2] Chinese Acad Med Sci & Peking Union Med Coll, Fuwai Hosp, Natl Ctr Cardiovasc Dis, Dept Informat Ctr, 167 Beilishi Rd, Beijing 100037, Peoples R China
关键词
Automated ICD coding; Feature extraction; Bag-of-words; BERT; Word2vec; Interpretability; CLINICAL CODES;
D O I
10.1186/s12911-022-01753-5
中图分类号
R-058 [];
学科分类号
摘要
Background: Automated ICD coding on medical texts via machine learning has been a hot topic. Related studies from medical field heavily relies on conventional bag-of-words (BoW) as the feature extraction method, and do not commonly use more complicated methods, such as word2vec (W2V) and large pretrained models like BERT. This study aimed at uncovering the most effective feature extraction methods for coding models by comparing BoW, W2V and BERT variants. Methods: We experimented with a Chinese dataset from Fuwai Hospital, which contains 6947 records and 1532 unique ICD codes, and a public Spanish dataset, which contains 1000 records and 2557 unique ICD codes. We designed coding tasks with different code frequency thresholds (denoted as f(s)), with a lower threshold indicating a more complex task. Using traditional classifiers, we compared BoW, W2V and BERT variants on accomplishing these coding tasks. Results: When f(s) was equal to or greater than 140 for Fuwai dataset, and 60 for the Spanish dataset, the BERT variants with the whole network fine-tuned was the best method, leading to a Micro-F1 of 93.9% for Fuwai data when f(s) = 200, and a Micro-F1 of 85.41% for the Spanish dataset when f(s) = 180. When f(s) fell below 140 for Fuwai dataset, and 60 for the Spanish dataset, BoW turned out to be the best, leading to a Micro-F1 of 83% for Fuwai dataset when f(s) = 20, and a Micro-F1 of 39.1% for the Spanish dataset when f(s) = 20. Our experiments also showed that both the BERT variants and BoW possessed good interpretability, which is important for medical applications of coding models. Conclusions: This study shed light on building promising machine learning models for automated ICD coding by revealing the most effective feature extraction methods. Concretely, our results indicated that fine-tuning the whole network of the BERT variants was the optimal method for tasks covering only frequent codes, especially codes that represented unspecified diseases, while BoW was the best for tasks involving both frequent and infrequent codes. The frequency threshold where the best-performing method varied differed between different datasets due to factors like language and codeset.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Comparison of different feature extraction methods for applicable automated ICD coding
    Zhao Shuai
    Diao Xiaolin
    Yuan Jing
    Huo Yanni
    Cui Meng
    Wang Yuxin
    Zhao Wei
    BMC Medical Informatics and Decision Making, 22
  • [2] Where are linear feature extraction methods applicable?
    Martínez, AM
    Zhu, ML
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2005, 27 (12) : 1934 - 1944
  • [3] Fusion: Towards Automated ICD Coding via Feature Compression
    Luo, Junyu
    Xiao, Cao
    Glass, Lucas
    Sun, Jimeng
    Ma, Fenglong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021,
  • [4] Comparison and Evaluation of Different Methods for the Feature Extraction from Educational Contents
    Aguilar, Jose
    Salazar, Camilo
    Velasco, Henry
    Monsalve-Pulido, Julian
    Montoya, Edwin
    COMPUTATION, 2020, 8 (02)
  • [5] Comparison of different feature extraction methods on classification of gene expression data
    Argunash, Ali Oezguer
    Akan, Batu
    Ercil, Aytuel
    Sezerman, Ugur
    2007 IEEE 15TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1-3, 2007, : 921 - +
  • [6] A Comparison of Deep Learning Methods for ICD Coding of Clinical Records
    Moons, Elias
    Khanna, Aditya
    Akkasi, Abbas
    Moens, Marie-Francine
    APPLIED SCIENCES-BASEL, 2020, 10 (15):
  • [7] A Neural Architecture for Automated ICD Coding
    Xie, Pengtao
    Shi, Haoran
    Zhang, Ming
    Xing, Eric P.
    PROCEEDINGS OF THE 56TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL), VOL 1, 2018, : 1066 - 1076
  • [8] Comparison of different feature extraction methods for EEG-based emotion recognition
    Nawaz, Rab
    Cheah, Kit Hwa
    Nisar, Humaira
    Yap, Vooi Voon
    BIOCYBERNETICS AND BIOMEDICAL ENGINEERING, 2020, 40 (03) : 910 - 926
  • [9] Comparison of Performance of Different Feature Extraction Methods in Detection of P300
    Amini, Zahra
    Abootalebi, Vahid
    Sadeghi, Mohammad T.
    BIOCYBERNETICS AND BIOMEDICAL ENGINEERING, 2013, 33 (01) : 3 - 20
  • [10] Computer-aided Diagnostics of Schizophrenia: Comparison of Different Feature Extraction Methods
    Radomir, Kus
    Daniel, Schwarz
    ACTA POLYTECHNICA HUNGARICA, 2017, 14 (05) : 181 - 196