Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems

被引:0
|
作者
Biookaghazadeh, Saman [1 ]
Zhou, Shujia [2 ]
Zhao, Ming [1 ]
机构
[1] Arizona State Univ, Tempe, AZ 85281 USA
[2] Northrup Grumman, Baltimore, MD USA
基金
美国国家科学基金会;
关键词
MAPREDUCE;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Big-Data systems are increasingly important for solving the data-driven problems in many science domains. However, existing big-data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big-data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layouts which allow efficient execution of subset queries targeting any dimension of a multi-dimensional dataset. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscience dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.
引用
收藏
页码:121 / 130
页数:10
相关论文
共 50 条
  • [41] Optimization and Control for Systems in the Big-Data Era: Theory and Applications
    Batabyal, Amitrajeet A.
    Shen, Wenjing
    [J]. INFORMS JOURNAL ON APPLIED ANALYTICS, 2021, 51 (03): : 242 - 244
  • [42] ETune: Efficient configuration tuning for big-data software systems via configuration space reduction
    Cao, Rong
    Bao, Liang
    Zhao, Kaibi
    Zhangsun, Panpan
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 209
  • [43] Advancing manufacturing systems with big-data analytics: A conceptual framework
    Kozjek, Dominik
    Vrabic, Rok
    Rihtarsic, Borut
    Lavrac, Nada
    Butala, Peter
    [J]. INTERNATIONAL JOURNAL OF COMPUTER INTEGRATED MANUFACTURING, 2020, 33 (02) : 169 - 188
  • [44] Libra and the Art of Task Sizing in Big-Data Analytic Systems
    Li, Rui
    Guo, Peizhen
    Hu, Bo
    Hu, Wenjun
    [J]. PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19), 2019, : 364 - 376
  • [45] Biology must develop its own big-data systems
    John Boyle
    [J]. Nature, 2013, 499 : 7 - 7
  • [46] Architecture of Geospatial Big-Data Batch Processing Model Based on Hadoop
    Kim, Sang-Su
    Yu, Sung-Hwan
    [J]. 2015 INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC), 2015, : 964 - 966
  • [47] Dense or Sparse : Elastic SPMM Implementation for Optimal Big-Data Processing
    Choi, Unho
    Lee, Kyungyong
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (02) : 637 - 652
  • [48] Harmony: An Approach for Geo-distributed Processing of Big-Data Applications
    Zhang, Han
    Ramapantulu, Lavanya
    Teo, Yong Meng
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 160 - 170
  • [49] Becoming data-savvy in a big-data world
    Xu, Meng
    Rhee, Seung Yon
    [J]. TRENDS IN PLANT SCIENCE, 2014, 19 (10) : 619 - 622
  • [50] Interpreting big-data analysis of retrospective observational data
    Huizinga, Tom W. J.
    Knevel, Rachel
    [J]. LANCET RHEUMATOLOGY, 2020, 2 (11): : E652 - E653