Comparison of different feature extraction methods for applicable automated ICD coding

被引:3
|
作者
Zhao Shuai [1 ]
Diao Xiaolin [1 ]
Yuan Jing [2 ]
Huo Yanni [1 ]
Cui Meng [1 ]
Wang Yuxin [1 ]
Zhao Wei [2 ]
机构
[1] Chinese Acad Med Sci & Peking Union Med Coll, Dept Informat Ctr, Fuwai Hosp, Beijing, Peoples R China
[2] Chinese Acad Med Sci & Peking Union Med Coll, Fuwai Hosp, Natl Ctr Cardiovasc Dis, Dept Informat Ctr, 167 Beilishi Rd, Beijing 100037, Peoples R China
关键词
Automated ICD coding; Feature extraction; Bag-of-words; BERT; Word2vec; Interpretability; CLINICAL CODES;
D O I
10.1186/s12911-022-01753-5
中图分类号
R-058 [];
学科分类号
摘要
Background: Automated ICD coding on medical texts via machine learning has been a hot topic. Related studies from medical field heavily relies on conventional bag-of-words (BoW) as the feature extraction method, and do not commonly use more complicated methods, such as word2vec (W2V) and large pretrained models like BERT. This study aimed at uncovering the most effective feature extraction methods for coding models by comparing BoW, W2V and BERT variants. Methods: We experimented with a Chinese dataset from Fuwai Hospital, which contains 6947 records and 1532 unique ICD codes, and a public Spanish dataset, which contains 1000 records and 2557 unique ICD codes. We designed coding tasks with different code frequency thresholds (denoted as f(s)), with a lower threshold indicating a more complex task. Using traditional classifiers, we compared BoW, W2V and BERT variants on accomplishing these coding tasks. Results: When f(s) was equal to or greater than 140 for Fuwai dataset, and 60 for the Spanish dataset, the BERT variants with the whole network fine-tuned was the best method, leading to a Micro-F1 of 93.9% for Fuwai data when f(s) = 200, and a Micro-F1 of 85.41% for the Spanish dataset when f(s) = 180. When f(s) fell below 140 for Fuwai dataset, and 60 for the Spanish dataset, BoW turned out to be the best, leading to a Micro-F1 of 83% for Fuwai dataset when f(s) = 20, and a Micro-F1 of 39.1% for the Spanish dataset when f(s) = 20. Our experiments also showed that both the BERT variants and BoW possessed good interpretability, which is important for medical applications of coding models. Conclusions: This study shed light on building promising machine learning models for automated ICD coding by revealing the most effective feature extraction methods. Concretely, our results indicated that fine-tuning the whole network of the BERT variants was the optimal method for tasks covering only frequent codes, especially codes that represented unspecified diseases, while BoW was the best for tasks involving both frequent and infrequent codes. The frequency threshold where the best-performing method varied differed between different datasets due to factors like language and codeset.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] A Comparison of Machine Learning Methods for the Diagnosis of Motor Faults Using Automated Spectral Feature Extraction Technique
    Muhammad Irfan
    Abdullah Saeed Alwadie
    Faisal AlThobiani
    Khurram Shehzad Quraishi
    Mohammed Jalalah
    Ali Abbass
    Saifur Rahman
    Mohammad Kamal Asif Khan
    Samar Alqhtani
    Journal of Nondestructive Evaluation, 2022, 41
  • [22] A Comparison of Machine Learning Methods for the Diagnosis of Motor Faults Using Automated Spectral Feature Extraction Technique
    Irfan, Muhammad
    Alwadie, Abdullah Saeed
    AlThobiani, Faisal
    Quraishi, Khurram Shehzad
    Jalalah, Mohammed
    Abbass, Ali
    Rahman, Saifur
    Khan, Mohammad Kamal Asif
    Alqhtani, Samar
    JOURNAL OF NONDESTRUCTIVE EVALUATION, 2022, 41 (02)
  • [23] Computing welding distortion: Comparison of different industrially applicable methods
    Tikhomirov, D.
    Rietman, B.
    Kose, K.
    Makkink, M.
    Sheet Metal 2005, 2005, 6-8 : 195 - 202
  • [24] Analysis of Different Image Enhancement and Feature Extraction Methods
    Veronica Lozano-Vazquez, Lucero
    Miura, Jun
    Jorge Rosales-Silva, Alberto
    Luviano-Juarez, Alberto
    Mujica-Vargas, Dante
    MATHEMATICS, 2022, 10 (14)
  • [25] Evaluation of feature extraction methods for different types of images
    Eman S. Sabry
    Salah S. Elagooz
    Fathi E. Abd El-Samie
    Nirmeen A. El-Bahnasawy
    Ghada M. El-Banby
    Rabie A. Ramadan
    Journal of Optics, 2023, 52 : 716 - 741
  • [26] Evaluation of feature extraction methods for different types of images
    Sabry, Eman S.
    Elagooz, Salah S.
    El-Samie, Fathi E. Abd
    El-Bahnasawy, Nirmeen A.
    El-Banby, Ghada M.
    Ramadan, Rabie A.
    JOURNAL OF OPTICS-INDIA, 2023, 52 (02): : 716 - 741
  • [27] A Performance Comparison of Feature Extraction Methods for Sentiment Analysis
    Hung, Lai Po
    Alfred, Rayner
    ADVANCED TOPICS IN INTELLIGENT INFORMATION AND DATABASE SYSTEMS, 2017, 710 : 379 - 390
  • [28] Comparison of Feature Extraction Methods for EEG BCI Classification
    Uktveris, Tomas
    Jusas, Vacius
    INFORMATION AND SOFTWARE TECHNOLOGIES, ICIST 2015, 2015, 538 : 81 - 92
  • [29] A Study of Comparison of Feature Extraction Methods for Handwriting Recognition
    Gunawan, Fergyanto F.
    Hapsari, Intan A.
    Soewito, Benfano
    Candra, Sevenpri
    2016 INTERNATIONAL SEMINAR ON INTELLIGENT TECHNOLOGY AND ITS APPLICATIONS (ISITIA): RECENT TRENDS IN INTELLIGENT COMPUTATIONAL TECHNOLOGIES FOR SUSTAINABLE ENERGY, 2016, : 73 - 78
  • [30] Comparison of feature extraction methods of vehicle vibration signal
    Liao, Qing-Bin
    Li, Shun-Ming
    Qin, Xiao-Pan
    Jilin Daxue Xuebao (Gongxueban)/Journal of Jilin University (Engineering and Technology Edition), 2007, 37 (04): : 910 - 914