Evaluation of predictive performance of modeling hyperuricemia using medical big data: comparison of data preprocessing methods

被引:0
|
作者
Li, Luwei [1 ,4 ]
Huang, Xian [1 ]
Yan, Cijin [3 ]
He, Shuzhan [3 ]
Cheng, Sishuai [4 ]
Yang, WenJie [2 ,5 ]
机构
[1] Sun Yat Sen Univ, Guangxi Hosp Div, Dept Rheumatol & Immunol, Affiliated Hosp 1, Nanning, Guangxi, Peoples R China
[2] Sun Yat Sen Univ, Guangxi Hosp Div, Affiliated Hosp 1, Dept Hematol, Nanning, Guangxi, Peoples R China
[3] Sun Yat Sen Univ, Guangxi Hosp Div, Affiliated Hosp 1, Dept Endocrinol, Nanning, Guangxi, Peoples R China
[4] Guilin Med Univ, Guilin, Guangxi, Peoples R China
[5] Sun Yat Sen Univ, Guangxi Hosp Div, Affiliated Hosp 1, 3 FoZiLing Rd, Nanning, Guangxi, Peoples R China
关键词
Medical big data; Hyperuricemia; Data preprocessing; Continuous variables; Categorized variables; Assignment; Modeling;
D O I
10.1186/s40537-025-01142-5
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
BackgroundUsing medical big data from two large-scale populations, a prediction model for continuous variables of raw data and a prediction model for categorical variables after assignment were constructed to evaluate the performance of the two forms of data preprocessing models.MethodPartial population data from the physical examination center of Guilin Medical University Affiliated Hospital from 2017 to 2019 were selected as the modeling group, with a total of 22,124 population data included. Selecting population data from NHANES database from 1998 to 2018 as the control group, a total of 28,021 population data were included. Logistic regression, LightGBM model, and Deep Neural Network were used to predict hyperuricemia in the form of continuous variables in the raw data. Then, the continuous variables in the raw data were assigned values to become categorical variables, and statistical analysis was performed using the same algorithm to obtain the predicted values of the two models. ROC curve analysis, Calibration curve analysis, DCA curve analysis, and CIC curve analysis were performed to comprehensively evaluate the accuracy, discriminatory ability, and clinical practicality of the two models.ResultIn the Logistic regression analysis of the continuous variable modeling group after controlling for confounding factors, a total of 11 variables showed statistical significance in the incidence of hyperuricemia. After assigning values, the Logistic regression analysis of the categorical variable modeling group showed that 9 variables had statistical significance in the incidence of hyperuricemia.In the Logistic regression analysis of continuous variables in the validation set, a total of 8 variables showed statistical significance in the incidence of hyperuricemia. After assignment, Logistic regression analysis of categorical variables showed that 10 variables had statistical significance in the incidence of hyperuricemia. The AUC values of the ROC curves of Logistic models, LightGBM models, and Deep Neural Networks with continuous variable types are higher than those of categorical variables. The average deviation between the continuous variable calibration curve prediction curve and the standard curve of the modeling and validation groups is generally lower than that of the categorical variables. The DCA curve and CIC curve of the modeling and validation groups both show that the clinical practicality of the continuous variable model is higher than that of the categorical variable model group.ConclusionIn the statistical analysis of hyperuricemia medical big data, directly using the continuous variable form of raw data for statistical analysis may result in better model performance than using the categorical variable form after assignment. However, the relevant parameters such as OR value obtained through assignment may have greater statistical and clinical guidance significance.
引用
收藏
页数:17
相关论文
共 50 条
  • [41] A Performance Evaluation of Classification Algorithms for Big Data
    Hai, Mo
    Zhang, You
    Zhang, Youjin
    5TH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND QUANTITATIVE MANAGEMENT, ITQM 2017, 2017, 122 : 1100 - 1107
  • [42] A Novel Modeling Approach to Assess the Electricity Consumption of LEED-Certified Research Buildings Using Big Data Predictive Methods
    Chokor, Abbas
    El Asmar, Mounir
    CONSTRUCTION RESEARCH CONGRESS 2016: OLD AND NEW CONSTRUCTION TECHNOLOGIES CONVERGE IN HISTORIC SAN JUAN, 2016, : 1040 - 1049
  • [43] A Novel Approach to Predictive Graphs using Big Data
    Ragavan, Harish
    Shanmugam, Srinivasan
    2016 IEEE 2ND INTERNATIONAL CONFERENCE ON BIG DATA SECURITY ON CLOUD (BIGDATASECURITY), IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE AND SMART COMPUTING (HPSC), AND IEEE INTERNATIONAL CONFERENCE ON INTELLIGENT DATA AND SECURITY (IDS), 2016, : 123 - 128
  • [44] Predictive Analysis for Diabetes Using Big Data Classification
    Rghioui, Amine
    Oumnad, Abdelmajid
    RECENT ADVANCES IN MATHEMATICS AND TECHNOLOGY, 2020, : 161 - 170
  • [45] A Survey of Preprocessing Methods Used for Analysis of Big Data Originated From Smart Grids
    Alghamdi, Turki Ali
    Javaid, Nadeem
    IEEE ACCESS, 2022, 10 : 29149 - 29171
  • [46] Big Data for Medical Image Analysis: A Performance Study
    Zhang, Rui
    Wang, Hongzhi
    Tewari, Renu
    Schmidt, Gero
    Kakrania, Deepika
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 1660 - 1664
  • [47] Comparison of TOPSIS, COPRAS, ARAS and WASPAS methods in performance evaluation using crisp and interval data
    Franek, Jiri
    MANAGING AND MODELLING OF FINANCIAL RISKS, 8TH INTERNATIONAL SCIENTIFIC CONFERENCE, PTS I & II, 2016, : 217 - 226
  • [48] Application of Big Data for Medical Data Analysis Using Hadoop Environment
    Roobini, M. S.
    Lakshmi, M.
    INTERNATIONAL CONFERENCE ON INTELLIGENT DATA COMMUNICATION TECHNOLOGIES AND INTERNET OF THINGS, ICICI 2018, 2019, 26 : 1128 - 1135
  • [49] Predictive big data analytic on demonetization data using support vector machine
    Kannan, Nattar
    Sivasubramanian, S.
    Kaliappan, M.
    Vimal, S.
    Suresh, A.
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (Suppl 6): : 14709 - 14720
  • [50] PERFORMANCE DATA - 3 COMPARISON METHODS
    HENRY, GT
    MCMILLAN, JH
    EVALUATION REVIEW, 1993, 17 (06) : 643 - 652