EVCA Classifier: A MCMC-Based Classifier for Analyzing High-Dimensional Big Data

被引:4
|
作者
Vlachou, Eleni [1 ]
Karras, Christos [1 ]
Karras, Aristeidis [1 ]
Tsolis, Dimitrios [2 ]
Sioutas, Spyros [1 ]
机构
[1] Univ Patras, Comp Engn & Informat Dept, Patras 26504, Greece
[2] Univ Patras, Dept Hist & Archaeol, Patras 26504, Greece
关键词
stochastic data engineering; Markov Chain Monte Carlo; big data management; Apache Spark; Bayesian inference; Bayesian ML; high-dimensional data; environment data analysis; CHAIN MONTE-CARLO; BAYESIAN-INFERENCE;
D O I
10.3390/info14080451
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this work, we introduce an innovative Markov Chain Monte Carlo (MCMC) classifier, a synergistic combination of Bayesian machine learning and Apache Spark, highlighting the novel use of this methodology in the spectrum of big data management and environmental analysis. By employing a large dataset of air pollutant concentrations in Madrid from 2001 to 2018, we developed a Bayesian Logistic Regression model, capable of accurately classifying the Air Quality Index (AQI) as safe or hazardous. This mathematical formulation adeptly synthesizes prior beliefs and observed data into robust posterior distributions, enabling superior management of overfitting, enhancing the predictive accuracy, and demonstrating a scalable approach for large-scale data processing. Notably, the proposed model achieved a maximum accuracy of 87.91% and an exceptional recall value of 99.58% at a decision threshold of 0.505, reflecting its proficiency in accurately identifying true negatives and mitigating misclassification, even though it slightly underperformed in comparison to the traditional Frequentist Logistic Regression in terms of accuracy and the AUC score. Ultimately, this research underscores the efficacy of Bayesian machine learning for big data management and environmental analysis, while signifying the pivotal role of the first-ever MCMC Classifier and Apache Spark in dealing with the challenges posed by large datasets and high-dimensional data with broader implications not only in sectors such as statistics, mathematics, physics but also in practical, real-world applications.
引用
收藏
页数:27
相关论文
共 50 条
  • [1] FCM Classifier for High-dimensional Data
    Ichihashi, Hidetomo
    Honda, Katsuhiro
    Notsu, Akira
    Miyamoto, Eri
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, VOLS 1-5, 2008, : 200 - 206
  • [2] Geometric Classifier for Multiclass, High-Dimensional Data
    Aoshima, Makoto
    Yata, Kazuyoshi
    [J]. SEQUENTIAL ANALYSIS-DESIGN METHODS AND APPLICATIONS, 2015, 34 (03): : 279 - 294
  • [3] Estimation of misclassification probability for a distance-based classifier in high-dimensional data
    Watanabe, Hiroki
    Hyodo, Masashi
    Yamada, Yuki
    Seo, Takashi
    [J]. HIROSHIMA MATHEMATICAL JOURNAL, 2019, 49 (02) : 175 - 193
  • [4] Classifier Ensemble Based on Multiview Optimization for High-Dimensional Imbalanced Data Classification
    Xu, Yuhong
    Yu, Zhiwen
    Chen, C. L. Philip
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (01) : 870 - 883
  • [5] Discriminative Ridge Machine: A Classifier for High-Dimensional Data or Imbalanced Data
    Peng, Chong
    Cheng, Qiang
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (06) : 2595 - 2609
  • [6] Adaptive Classifier Ensemble Method Based on Spatial Perception for High-Dimensional Data Classification
    Xu, Yuhong
    Yu, Zhiwen
    Cao, Wenming
    Chen, C. L. Philip
    You, Jane
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2021, 33 (07) : 2847 - 2862
  • [7] A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data
    Aoshima, Makoto
    Yata, Kazuyoshi
    [J]. ANNALS OF THE INSTITUTE OF STATISTICAL MATHEMATICS, 2014, 66 (05) : 983 - 1010
  • [8] A Novel Classifier Ensemble Method Based on Subspace Enhancement for High-Dimensional Data Classification
    Xu, Yuhong
    Yu, Zhiwen
    Cao, Wenming
    Chen, C. L. Philip
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (01) : 16 - 30
  • [9] A distance-based, misclassification rate adjusted classifier for multiclass, high-dimensional data
    Makoto Aoshima
    Kazuyoshi Yata
    [J]. Annals of the Institute of Statistical Mathematics, 2014, 66 : 983 - 1010
  • [10] Semi-supervised classifier ensemble model for high-dimensional data
    Niu, Xufeng
    Ma, Wenping
    [J]. INFORMATION SCIENCES, 2023, 643