Preprocessing of unstructured medical data: the impact of each preprocessing stage on classification

被引:5
|
作者
Kashina, M. [1 ]
Lenivtceva, I. D. [1 ]
Kopanitsa, G. D. [1 ]
机构
[1] ITMO Univ, St Petersburg, Russia
来源
9TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE IN COMPUTATIONAL SCIENCE, YSC2020 | 2020年 / 178卷
关键词
preprocessing; tokenization; classifier; medical text; natural language processing; allergy;
D O I
10.1016/j.procs.2020.11.030
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, it is still important to develop methods for processing data, in particular medical texts, in Russian. In this paper, we checked how each stage of text pre-processing affects the result of the classifier. The paper analyzed 269923 records of allergic anamnesis of patients, 11670 of which were placed for further processing. We consider the main stages of pre-processing: tokenization, deletion of stop words, error correction, document cropping, normalization, class harmonization, and vectorization. To vectorize the data, we have selected the Bag-of-Words. The method of logistic regression was chosen for classification, since it has easy reproducibility and interpretation. Precision, recall and F-measure were selected as evaluation metrics. The results (F = 88.12%) showed that the most effective was the stage of normalization and error correction. (C) 2020 The Authors. Published by Elsevier B.V.
引用
收藏
页码:284 / 290
页数:7
相关论文
共 50 条
  • [21] Unsupervised preprocessing to improve generalisation for medical image classification
    Kirkerod, Mathias
    Borgli, Rune Johan
    Thambawita, Vajira
    Hicks, Steven
    Riegler, Michael Alexander
    Halvorsen, Pal
    2019 13TH INTERNATIONAL SYMPOSIUM ON MEDICAL INFORMATION AND COMMUNICATION TECHNOLOGY (ISMICT), 2019, : 169 - 174
  • [22] A scoping review of preprocessing methods for unstructured text data to assess data quality
    Nesca, Marcello
    Katz, Alan
    Leung, Carson K.
    Lix, Lisa M.
    INTERNATIONAL JOURNAL OF POPULATION DATA SCIENCE (IJPDS), 2022, 7 (01):
  • [23] Data preprocessing
    Teillet, P
    Phulpin, T
    PHYSICAL MEASUREMENTS AND SIGNATURES IN REMOTE SENSING, VOLS 1 AND 2, 1997, : 885 - 886
  • [24] The Impact of Distributed Data Preprocessing on Automotive Data Streams
    Tawakuli, Amal
    Engel, Thomas
    2022 IEEE 96TH VEHICULAR TECHNOLOGY CONFERENCE (VTC2022-FALL), 2022,
  • [25] Imbalanced Data Stream Classification Using Hybrid Data Preprocessing
    Bobowska, Barbara
    Klikowski, Jakub
    Wozniak, Michal
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT II, 2020, 1168 : 402 - 413
  • [26] Imbalanced data preprocessing model for web service classification
    Rhmann, Wasiur
    Ishrat, Amaan
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2024, 15 (10) : 4825 - 4837
  • [27] PREPROCESSING FOR CLASSIFICATION OF SPARSE DATA: APPLICATION TO TRAJECTORY RECOGNITION
    Mayoue, A.
    Barthelemy, Q.
    Onis, S.
    Larue, A.
    2012 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP), 2012, : 37 - 40
  • [28] Data preprocessing in semi-supervised SVM classification
    Astorino, A.
    Gorgone, E.
    Gaudioso, M.
    Pallaschke, D.
    OPTIMIZATION, 2011, 60 (1-2) : 143 - 151
  • [29] Preprocessing time series data for classification with application to CRM
    Yang, YM
    Yang, Q
    Lu, W
    Pan, JL
    Pan, R
    Lu, CH
    Li, L
    Qin, ZX
    AI 2005: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2005, 3809 : 133 - 142
  • [30] Preprocessing of GPR data for syntactic landmine detection and classification
    Nasif, Ahmed O.
    Hintz, Kenneth J.
    Peixoto, Nathalia
    DETECTION AND SENSING OF MINES, EXPLOSIVE OBJECTS, AND OBSCURED TARGETS XV, 2010, 7664