Preprocessing of unstructured medical data: the impact of each preprocessing stage on classification

被引:5
|
作者
Kashina, M. [1 ]
Lenivtceva, I. D. [1 ]
Kopanitsa, G. D. [1 ]
机构
[1] ITMO Univ, St Petersburg, Russia
关键词
preprocessing; tokenization; classifier; medical text; natural language processing; allergy;
D O I
10.1016/j.procs.2020.11.030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, it is still important to develop methods for processing data, in particular medical texts, in Russian. In this paper, we checked how each stage of text pre-processing affects the result of the classifier. The paper analyzed 269923 records of allergic anamnesis of patients, 11670 of which were placed for further processing. We consider the main stages of pre-processing: tokenization, deletion of stop words, error correction, document cropping, normalization, class harmonization, and vectorization. To vectorize the data, we have selected the Bag-of-Words. The method of logistic regression was chosen for classification, since it has easy reproducibility and interpretation. Precision, recall and F-measure were selected as evaluation metrics. The results (F = 88.12%) showed that the most effective was the stage of normalization and error correction. (C) 2020 The Authors. Published by Elsevier B.V.
引用
收藏
页码:284 / 290
页数:7
相关论文
共 50 条
  • [1] Impact of preprocessing on medical data classification
    Sarab ALMUHAIDEB
    Mohamed El Bachir MENAI
    Frontiers of Computer Science, 2016, 10 (06) : 1082 - 1102
  • [2] Impact of preprocessing on medical data classification
    Sarab Almuhaideb
    Mohamed El Bachir Menai
    Frontiers of Computer Science, 2016, 10 : 1082 - 1102
  • [3] Impact of preprocessing on medical data classification
    Almuhaideb, Sarab
    Menai, Mohamed El Bachir
    FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (06) : 1082 - 1102
  • [4] An individualized preprocessing for medical data classification
    AlMuhaideb, Sarab
    Menai, Mohamed El Bachir
    4TH SYMPOSIUM ON DATA MINING APPLICATIONS (SDMA2016), 2016, 82 : 35 - 42
  • [5] The impact of preprocessing on text classification
    Uysal, Alper Kursat
    Gunal, Serkan
    INFORMATION PROCESSING & MANAGEMENT, 2014, 50 (01) : 104 - 112
  • [6] Classification and Preprocessing in the Stock Data
    Juszczuk, Przemyslaw
    Kozak, Jan
    BUSINESS INFORMATION SYSTEMS WORKSHOPS, BIS 2017, 2017, 303 : 269 - 281
  • [7] Improving medical/biological data classification performance by wavelet preprocessing
    Li, Q
    Li, T
    Zhu, SH
    Kambhamettu, C
    2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 657 - 660
  • [8] Impact of Boolean factorization as preprocessing methods for classification of Boolean data
    Radim Belohlavek
    Jan Outrata
    Martin Trnecka
    Annals of Mathematics and Artificial Intelligence, 2014, 72 : 3 - 22
  • [9] Impact of Boolean factorization as preprocessing methods for classification of Boolean data
    Belohlavek, Radim
    Outrata, Jan
    Trnecka, Martin
    ANNALS OF MATHEMATICS AND ARTIFICIAL INTELLIGENCE, 2014, 72 (1-2) : 3 - 22
  • [10] MC: a Unsupervised Data Preprocessing for Classification
    Hu, Enliang
    Chen, Songcan
    Yin, Xuesong
    2008 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION TECHNOLOGY APPLICATION, VOL I, PROCEEDINGS, 2008, : 259 - 263