ClimateSpark: An in-memory distributed computing framework for big climate data analytics

被引：28

作者：

Hu, Fei ^{[1
]}

Yang, Chaowei ^{[1
]}

Schnase, John L. ^{[2
]}

Duffy, Daniel Q. ^{[2
]}

Xu, Mengchao ^{[1
]}

Bowen, Michael K. ^{[2
]}

Lee, Tsengdar ^{[3
]}

Song, Weiwei ^{[1
]}

机构：

[1] George Mason Univ, NSF Spatioternporal Innovat Ctr, Fairfax, VA 22030 USA

[2] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA

[3] NASA Headquarters, Washington, DC USA

来源：

COMPUTERS & GEOSCIENCES | 2018年 / 115卷

关键词：

Big data; High performance computing; Array-based data model; Climate data analytics; Apache spark; Geospatial cyberinfrastructure; Cloud computing; CLOUD; CHALLENGES;

D O I：

10.1016/j.cageo.2018.03.011

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark, to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.

引用

页码：154 / 166

页数：13

共 50 条

[1] Distributed In-Memory Analytics for Big Temporal Data
Yao, Bin
Zhang, Wei
Wang, Zhi-Jie
Chen, Zhongpu
Shang, Shuo
Zheng, Kai
Guo, Minyi
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2018, PT I, 2018, 10827 : 549 - 565
[2] Exploration of In-Memory Computing for Big Data Analytics using Queuing Theory
Srivastava, Riktesh
[J]. 2018 2ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPILATION, COMPUTING AND COMMUNICATIONS (HP3C 2018), 2018, : 11 - 16
[3] In-Memory Computing for Scalable Data Analytics
Li, Jun
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2015), 2015, : 93 - 94
[4] Design and implementation of reconfigurable acceleration for in-memory distributed big data computing
Hou, Junjie
Zhu, Yongxin
Du, Sen
Song, Shijin
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 92 : 68 - 75
[5] Optimizing Performance and Computing Resource Management of in-memory Big Data Analytics with Disaggregated Persistent Memory
Chen, Shouwei
Wang, Wensheng
Wu, Xueyang
Fan, Zhen
Huang, Kunwu
Zhuang, Peiyu
Li, Yue
Rodero, Ivan
Parashar, Manish
Weng, Dennis
[J]. 2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 21 - 30
[6] An In-Memory based Framework for Scientific Data Analytics
Elia, Donatello
Fiore, Sandro
D'Anca, Alessandro
Palazzo, Cosimo
Foster, Ian
Williams, Dean N.
[J]. PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 424 - 429
[7] Using In-Memory Analytics to Quickly Crunch Big Data
Garber, Lee
[J]. COMPUTER, 2012, 45 (10) : 16 - 18
[8] Distributed Big Data Analytics in Service Computing
Yu, Weider D.
Gottumukkala, AvinashChander
Senthailselvi, Deenash Arivazhagan
Maniraj, Prabhu
Khonde, Tushar
[J]. 2017 IEEE 13TH INTERNATIONAL SYMPOSIUM ON AUTONOMOUS DECENTRALIZED SYSTEMS (ISADS 2017), 2017, : 55 - 60
[9] Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters
Koliopoulos, Aris-Kyriakos
Yiapanis, Paraskevas
Tekiner, Firat
Nenadic, Goran
Keane, John
[J]. 2016 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2016, 2016, : 353 - 356
[10] Survey of In-memory Big Data Analytics and Latest Research Opportunities
Gangarde, Rupali
Pawar, Ambika
Dani, Ajay
[J]. 2016 FOURTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2016, : 197 - 201

← 1 2 3 4 5 →