Spatiotemporal data partitioning for distributed random forest algorithm: Air quality prediction using imbalanced big spatiotemporal data on spark distributed framework

被引:16
|
作者
Asgari, Marjan [1 ]
Yang, Wanhong [1 ]
Farnaghi, Mahdi [2 ]
机构
[1] Univ Guelph, Dept Geog Environm & Geomat, Guelph, ON, Canada
[2] Univ Twente, Fac Geoinformat Sci & Earth Observat, Twente, Netherlands
关键词
Big spatiotemporal data; Distributed systems; Air quality prediction; Distributed random forest algorithm; Imbalanced data; APACHE SPARK; POLLUTION; CLASSIFICATION;
D O I
10.1016/j.eti.2022.102776
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Spatiotemporal air quality datasets are typically collected hourly in monitoring stations deployed non-uniformly across a metropolitan city. These datasets are not only big, which poses challenges on the storage and processing capacity of centralized computing systems but also imbalanced and spatially heterogeneous, which may result in biased air quality prediction. To address these challenges, we designed and developed a parallel air quality prediction system equipped with a spatiotemporal data partitioning method, a distributed machine learning algorithm, Hadoop's distributed data storage platform and its resource scheduler/manager, and Spark's efficient and in-memory execution environment, which is suitable for running iterative algorithms, e.g., machine learning. Our proposed spatiotemporal partitioning method accounted for imbalance and spatial heterogeneity features of big air quality data in predictive models, which comply with the load-balancing requirement of distributed computing systems. Distributed Random Forest algorithm in the H2O library of the Spark framework was selected as the distributed machine learning algorithm to develop the air quality predictive model. This algorithm is an ensemble forest with algorithm-level adjustments to perform as efficiently as possible for big imbalanced datasets. An application of the parallel quality prediction system for Tehran, Iran showed that the parallel prediction system had considerable speedup gain and improved both the overall accuracy and class precision of air quality prediction when working with imbalanced big spatiotemporal air quality datasets. A future research direction is to add data streaming and visualization functions to the system to provide rapid and reliable air quality prediction for supporting environmental health management. (c) 2022 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Introduction to distributed and parallel processing of big spatiotemporal data
    Shang, Shuo
    He, Bingsheng
    Wang, Lizhe
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 151 : 98 - 99
  • [2] A distributed frequent itemset mining algorithm using Spark for Big Data analytics
    Zhang, Feng
    Liu, Min
    Gui, Feng
    Shen, Weiming
    Shami, Abdallah
    Ma, Yunlong
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2015, 18 (04): : 1493 - 1501
  • [3] A distributed frequent itemset mining algorithm using Spark for Big Data analytics
    Feng Zhang
    Min Liu
    Feng Gui
    Weiming Shen
    Abdallah Shami
    Yunlong Ma
    [J]. Cluster Computing, 2015, 18 : 1493 - 1501
  • [4] Enhanced SMOTE Algorithm for Classification of Imbalanced Big-Data using Random Forest
    Bhagat, Reshma C.
    Patil, Sachin S.
    [J]. 2015 IEEE INTERNATIONAL ADVANCE COMPUTING CONFERENCE (IACC), 2015, : 403 - 408
  • [5] On the use of MapReduce for imbalanced big data using Random Forest
    del Rio, Sara
    Lopez, Victoria
    Manuel Benitez, Jose
    Herrera, Francisco
    [J]. INFORMATION SCIENCES, 2014, 285 : 112 - 137
  • [6] Big data quality prediction in the process industry: A distributed parallel modeling framework
    Yao, Le
    Ge, Zhiqiang
    [J]. JOURNAL OF PROCESS CONTROL, 2018, 68 : 1 - 13
  • [7] Deep hybrid learning framework for spatiotemporal crash prediction using big traffic data
    Kashifi, Mohammad Tamim
    Al-Turki, Mohammed
    Sharify, Abdul Wakil
    [J]. INTERNATIONAL JOURNAL OF TRANSPORTATION SCIENCE AND TECHNOLOGY, 2023, 12 (03) : 793 - 808
  • [8] Spark Based Distributed Deep Learning Framework For Big Data Applications
    Khumoyun, Akhmedov
    Cui, Yun
    Hanku, Lee
    [J]. 2016 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND COMMUNICATIONS TECHNOLOGIES (ICISCT), 2016,
  • [9] A distributed evolutionary based instance selection algorithm for big data using Apache Spark
    Qin, Liyang
    Wang, Xiaoli
    Yin, Linzi
    Jiang, Zhaohui
    [J]. APPLIED SOFT COMPUTING, 2024, 159
  • [10] Efficient and Customizable Data Partitioning Framework for Distributed Big RDF Data Processing in the Cloud
    Lee, Kisung
    Liu, Ling
    Tang, Yuzhe
    Zhang, Qi
    Zhou, Yang
    [J]. 2013 IEEE SIXTH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD 2013), 2013, : 327 - 334