ClimateSpark: An in-memory distributed computing framework for big climate data analytics

被引：28

作者：

Hu, Fei ^{[1
]}

Yang, Chaowei ^{[1
]}

Schnase, John L. ^{[2
]}

Duffy, Daniel Q. ^{[2
]}

Xu, Mengchao ^{[1
]}

Bowen, Michael K. ^{[2
]}

Lee, Tsengdar ^{[3
]}

Song, Weiwei ^{[1
]}

机构：

[1] George Mason Univ, NSF Spatioternporal Innovat Ctr, Fairfax, VA 22030 USA

[2] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA

[3] NASA Headquarters, Washington, DC USA

来源：

COMPUTERS & GEOSCIENCES | 2018年 / 115卷

关键词：

Big data; High performance computing; Array-based data model; Climate data analytics; Apache spark; Geospatial cyberinfrastructure; Cloud computing; CLOUD; CHALLENGES;

D O I：

10.1016/j.cageo.2018.03.011

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark, to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.

引用

页码：154 / 166

页数：13

共 50 条

[31] A perspective on applications of in-memory and associative approaches supporting cultural big data analytics
Chianese, Angelo
Piccialli, Francesco
[J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2018, 16 (03) : 219 - 233
[32] SparkNN: A distributed in-memory data partitioning for KNN queries on big spatial data
Al Aghbari, Zaher
Ismail, Tasneem
Kamel, Ibrahim
[J]. Data Science Journal, 2020, 19 (01) : 1 - 14
[33] In-Memory Performance for Big Data
Graefe, Goetz
Volos, Haris
Kimura, Hideaki
Kuno, Harumi
Tucek, Joseph
Lillibridge, Mark
Veitch, Alistair
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (01): : 37 - 48
[34] A Parallel Randomized Neural Network on In-memory Cluster Computing for Big Data
Dai, Tongwu
Li, Kenli
Chen, Cen
[J]. 2017 13TH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY (ICNC-FSKD), 2017,
[35] Memory-Disaggregated In-Memory Object Store Framework for Big Data Applications
Abrahamse, Robin
Hadnagy, Akos
Al-Ars, Zaid
[J]. 2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW 2022), 2022, : 1228 - 1234
[36] Memory-Disaggregated In-Memory Object Store Framework for Big Data Applications
Abrahamse, Robin
Hadnagy, Akos
Al-Ars, Zaid
[J]. Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022, 2022, : 1228 - 1234
[37] Implementation of Learning Analytics Framework for MOOCs using State-of-the-art In-Memory Computing
Laveti, Ramesh Naidu
Kuppili, Swetha
Ch, Janaki
Pal, Supriya N.
Babu, N. Sarat Chandra
[J]. 2017 5TH NATIONAL CONFERENCE ON E-LEARNING & E-LEARNING TECHNOLOGIES (ELELTECH), 2017,
[38] Distributed PARAFAC Decomposition Method Based on In-memory Big Data System
Yang, Hye-Kyung
Yong, Hwan-Seung
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, 2019, 11448 : 292 - 295
[39] Distributed Big Data Computing for Supporting Predictive Analytics of Service Requests
Wang, Tianlei
Harvey, James D.
Leung, Carson K.
Pazdor, Adam G. M.
Chauhan, Animesh Singh
Fan, Lihe
Cuzzocrea, Alfredo
[J]. 2021 IEEE 45TH ANNUAL COMPUTERS, SOFTWARE, AND APPLICATIONS CONFERENCE (COMPSAC 2021), 2021, : 1723 - 1728
[40] Benchmarking of Distributed Computing Engines Spark and GraphLab for Big Data Analytics
Wei, Jian
Chen, Kai
Zhou, Yi
Zhou, Qu
He, Jianhua
[J]. PROCEEDINGS 2016 IEEE SECOND INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2016), 2016, : 10 - 13

← 1 2 3 4 5 →