ClimateSpark: An in-memory distributed computing framework for big climate data analytics

被引:28
|
作者
Hu, Fei [1 ]
Yang, Chaowei [1 ]
Schnase, John L. [2 ]
Duffy, Daniel Q. [2 ]
Xu, Mengchao [1 ]
Bowen, Michael K. [2 ]
Lee, Tsengdar [3 ]
Song, Weiwei [1 ]
机构
[1] George Mason Univ, NSF Spatioternporal Innovat Ctr, Fairfax, VA 22030 USA
[2] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA
[3] NASA Headquarters, Washington, DC USA
关键词
Big data; High performance computing; Array-based data model; Climate data analytics; Apache spark; Geospatial cyberinfrastructure; Cloud computing; CLOUD; CHALLENGES;
D O I
10.1016/j.cageo.2018.03.011
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark, to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.
引用
收藏
页码:154 / 166
页数:13
相关论文
共 50 条
  • [41] Distributed In-Memory Computing on Binary RRAM Crossbar
    Ni, Leibin
    Huang, Hantao
    Liu, Zichuan
    Joshi, Rajiv V.
    Yu, Hao
    [J]. ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 2017, 13 (03)
  • [42] YinMem: a distributed parallel indexed in-memory computation system for large scale data analytics
    Huang, Yin
    Yesha, Yelena
    Halem, Milton
    Yesha, Yaacov
    Zhou, Shujia
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 214 - 222
  • [43] Lightweight distributed computing framework for orchestrating high performance computing and big data
    Ince, Muhammed Numan
    Gunay, Melih
    Ledet, Joseph
    [J]. TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2022, 30 (04) : 1571 - 1585
  • [44] Nomadic Computing for Big Data Analytics
    Yu, Hsiang-Fu
    Hsieh, Cho-Jui
    Yun, Hyokun
    Vishwanathan, S. V. N.
    Dhillon, Inderjit
    [J]. COMPUTER, 2016, 49 (04) : 52 - 60
  • [45] Big Data Analytics for Sustainable Computing
    Anandakumar, H.
    Arulmurugan, R.
    Onn, Chow Chee
    [J]. MOBILE NETWORKS & APPLICATIONS, 2019, 24 (06): : 1751 - 1754
  • [46] Big Data Analytics for Sustainable Computing
    H . Anandakumar
    R. Arulmurugan
    Chow Chee Onn
    [J]. Mobile Networks and Applications, 2019, 24 : 1751 - 1754
  • [47] Distributed In-memory Cluster Computing Approach in Scala for Solving Graph Data Applications
    Johnpaul, C., I
    Thampi, Neetha Susan
    [J]. 2014 INTERNATIONAL CONFERENCE ON ADVANCES IN ELECTRONICS, COMPUTERS AND COMMUNICATIONS (ICAECC), 2014,
  • [48] An algebra for distributed Big Data analytics
    Fegaras, Leonidas
    [J]. JOURNAL OF FUNCTIONAL PROGRAMMING, 2017, 27
  • [49] Distributed Analytics For Big Data: A Survey
    Berloco, Francesco
    Bevilacqua, Vitoantonio
    Colucci, Simona
    [J]. NEUROCOMPUTING, 2024, 574
  • [50] Big IoT Healthcare Data Analytics Framework Based on Fog and Cloud Computing
    Alshammari, Hamoud
    Abd El-Ghany, Sameh
    Shehab, Abdulaziz
    [J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2020, 16 (06): : 1238 - 1249