ClimateSpark: An in-memory distributed computing framework for big climate data analytics

被引:28
|
作者
Hu, Fei [1 ]
Yang, Chaowei [1 ]
Schnase, John L. [2 ]
Duffy, Daniel Q. [2 ]
Xu, Mengchao [1 ]
Bowen, Michael K. [2 ]
Lee, Tsengdar [3 ]
Song, Weiwei [1 ]
机构
[1] George Mason Univ, NSF Spatioternporal Innovat Ctr, Fairfax, VA 22030 USA
[2] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA
[3] NASA Headquarters, Washington, DC USA
关键词
Big data; High performance computing; Array-based data model; Climate data analytics; Apache spark; Geospatial cyberinfrastructure; Cloud computing; CLOUD; CHALLENGES;
D O I
10.1016/j.cageo.2018.03.011
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark, to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.
引用
收藏
页码:154 / 166
页数:13
相关论文
共 50 条
  • [31] A perspective on applications of in-memory and associative approaches supporting cultural big data analytics
    Chianese, Angelo
    Piccialli, Francesco
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2018, 16 (03) : 219 - 233
  • [32] SparkNN: A distributed in-memory data partitioning for KNN queries on big spatial data
    Al Aghbari, Zaher
    Ismail, Tasneem
    Kamel, Ibrahim
    [J]. Data Science Journal, 2020, 19 (01) : 1 - 14
  • [33] In-Memory Performance for Big Data
    Graefe, Goetz
    Volos, Haris
    Kimura, Hideaki
    Kuno, Harumi
    Tucek, Joseph
    Lillibridge, Mark
    Veitch, Alistair
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (01): : 37 - 48
  • [34] A Parallel Randomized Neural Network on In-memory Cluster Computing for Big Data
    Dai, Tongwu
    Li, Kenli
    Chen, Cen
    [J]. 2017 13TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2017,
  • [35] Memory-Disaggregated In-Memory Object Store Framework for Big Data Applications
    Abrahamse, Robin
    Hadnagy, Akos
    Al-Ars, Zaid
    [J]. 2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 1228 - 1234
  • [36] Memory-Disaggregated In-Memory Object Store Framework for Big Data Applications
    Abrahamse, Robin
    Hadnagy, Akos
    Al-Ars, Zaid
    [J]. Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022, 2022, : 1228 - 1234
  • [37] Implementation of Learning Analytics Framework for MOOCs using State-of-the-art In-Memory Computing
    Laveti, Ramesh Naidu
    Kuppili, Swetha
    Ch, Janaki
    Pal, Supriya N.
    Babu, N. Sarat Chandra
    [J]. 2017 5TH NATIONAL CONFERENCE ON E-LEARNING & E-LEARNING TECHNOLOGIES (ELELTECH), 2017,
  • [38] Distributed PARAFAC Decomposition Method Based on In-memory Big Data System
    Yang, Hye-Kyung
    Yong, Hwan-Seung
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2019, 11448 : 292 - 295
  • [39] Distributed Big Data Computing for Supporting Predictive Analytics of Service Requests
    Wang, Tianlei
    Harvey, James D.
    Leung, Carson K.
    Pazdor, Adam G. M.
    Chauhan, Animesh Singh
    Fan, Lihe
    Cuzzocrea, Alfredo
    [J]. 2021 IEEE 45TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2021), 2021, : 1723 - 1728
  • [40] Benchmarking of Distributed Computing Engines Spark and GraphLab for Big Data Analytics
    Wei, Jian
    Chen, Kai
    Zhou, Yi
    Zhou, Qu
    He, Jianhua
    [J]. PROCEEDINGS 2016 IEEE SECOND INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2016), 2016, : 10 - 13