ClimateSpark: An in-memory distributed computing framework for big climate data analytics

被引:28
|
作者
Hu, Fei [1 ]
Yang, Chaowei [1 ]
Schnase, John L. [2 ]
Duffy, Daniel Q. [2 ]
Xu, Mengchao [1 ]
Bowen, Michael K. [2 ]
Lee, Tsengdar [3 ]
Song, Weiwei [1 ]
机构
[1] George Mason Univ, NSF Spatioternporal Innovat Ctr, Fairfax, VA 22030 USA
[2] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA
[3] NASA Headquarters, Washington, DC USA
关键词
Big data; High performance computing; Array-based data model; Climate data analytics; Apache spark; Geospatial cyberinfrastructure; Cloud computing; CLOUD; CHALLENGES;
D O I
10.1016/j.cageo.2018.03.011
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark, to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.
引用
收藏
页码:154 / 166
页数:13
相关论文
共 50 条
  • [1] Distributed In-Memory Analytics for Big Temporal Data
    Yao, Bin
    Zhang, Wei
    Wang, Zhi-Jie
    Chen, Zhongpu
    Shang, Shuo
    Zheng, Kai
    Guo, Minyi
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2018, PT I, 2018, 10827 : 549 - 565
  • [2] Exploration of In-Memory Computing for Big Data Analytics using Queuing Theory
    Srivastava, Riktesh
    [J]. 2018 2ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPILATION, COMPUTING AND COMMUNICATIONS (HP3C 2018), 2018, : 11 - 16
  • [3] In-Memory Computing for Scalable Data Analytics
    Li, Jun
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E 2015), 2015, : 93 - 94
  • [4] Design and implementation of reconfigurable acceleration for in-memory distributed big data computing
    Hou, Junjie
    Zhu, Yongxin
    Du, Sen
    Song, Shijin
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 92 : 68 - 75
  • [5] Optimizing Performance and Computing Resource Management of in-memory Big Data Analytics with Disaggregated Persistent Memory
    Chen, Shouwei
    Wang, Wensheng
    Wu, Xueyang
    Fan, Zhen
    Huang, Kunwu
    Zhuang, Peiyu
    Li, Yue
    Rodero, Ivan
    Parashar, Manish
    Weng, Dennis
    [J]. 2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 21 - 30
  • [6] An In-Memory based Framework for Scientific Data Analytics
    Elia, Donatello
    Fiore, Sandro
    D'Anca, Alessandro
    Palazzo, Cosimo
    Foster, Ian
    Williams, Dean N.
    [J]. PROCEEDINGS OF THE ACM INTERNATIONAL CONFERENCE ON COMPUTING FRONTIERS (CF'16), 2016, : 424 - 429
  • [7] Using In-Memory Analytics to Quickly Crunch Big Data
    Garber, Lee
    [J]. COMPUTER, 2012, 45 (10) : 16 - 18
  • [8] Distributed Big Data Analytics in Service Computing
    Yu, Weider D.
    Gottumukkala, AvinashChander
    Senthailselvi, Deenash Arivazhagan
    Maniraj, Prabhu
    Khonde, Tushar
    [J]. 2017 IEEE 13TH INTERNATIONAL SYMPOSIUM ON AUTONOMOUS DECENTRALIZED SYSTEMS (ISADS 2017), 2017, : 55 - 60
  • [9] Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters
    Koliopoulos, Aris-Kyriakos
    Yiapanis, Paraskevas
    Tekiner, Firat
    Nenadic, Goran
    Keane, John
    [J]. 2016 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2016, 2016, : 353 - 356
  • [10] Survey of In-memory Big Data Analytics and Latest Research Opportunities
    Gangarde, Rupali
    Pawar, Ambika
    Dani, Ajay
    [J]. 2016 FOURTH INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND GRID COMPUTING (PDGC), 2016, : 197 - 201