Kaleido: Enabling Efficient Scientific Data Processing on Big-Data Systems

被引：0

作者：

Biookaghazadeh, Saman ^{[1
]}

Zhou, Shujia ^{[2
]}

Zhao, Ming ^{[1
]}

机构：

[1] Arizona State Univ, Tempe, AZ 85281 USA

[2] Northrup Grumman, Baltimore, MD USA

来源：

2017 INTERNATIONAL CONFERENCE ON NETWORKING, ARCHITECTURE, AND STORAGE (NAS) | 2017年

基金：

美国国家科学基金会;

关键词：

MAPREDUCE;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

Big-Data systems are increasingly important for solving the data-driven problems in many science domains. However, existing big-data systems cannot support the efficient processing of self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains. This paper presents Kaleido, a solution to this problem by enabling big-data systems to efficiently store and process scientific data. Specifically, it enables Hadoop to directly store NetCDF data on HDFS, and process them in MapReduce using convenient APIs. It also enables Hive to support queries on NetCDF data, transparent to the users. Moreover, it employs optimizations tailored to scientific data, particularly dimension-aware layouts which allow efficient execution of subset queries targeting any dimension of a multi-dimensional dataset. The paper presents a comprehensive evaluation of Kaleido using representative queries on a typical geoscience dataset. The results show that Kaleido achieves substantial speedup and space saving compared to existing solutions for storing and processing NetCDF data on Hadoop, and it also substantially outperforms the state-of-the-art solutions for supporting subset queries on scientific data.

引用

页码：121 / 130

页数：10

共 50 条

[41] Optimization and Control for Systems in the Big-Data Era: Theory and Applications
Batabyal, Amitrajeet A.
Shen, Wenjing
[J]. INFORMS JOURNAL ON APPLIED ANALYTICS, 2021, 51 (03): : 242 - 244
[42] ETune: Efficient configuration tuning for big-data software systems via configuration space reduction
Cao, Rong
Bao, Liang
Zhao, Kaibi
Zhangsun, Panpan
[J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2024, 209
[43] Advancing manufacturing systems with big-data analytics: A conceptual framework
Kozjek, Dominik
Vrabic, Rok
Rihtarsic, Borut
Lavrac, Nada
Butala, Peter
[J]. INTERNATIONAL JOURNAL OF COMPUTER INTEGRATED MANUFACTURING, 2020, 33 (02) : 169 - 188
[44] Libra and the Art of Task Sizing in Big-Data Analytic Systems
Li, Rui
Guo, Peizhen
Hu, Bo
Hu, Wenjun
[J]. PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19), 2019, : 364 - 376
[45] Biology must develop its own big-data systems
John Boyle
[J]. Nature, 2013, 499 : 7 - 7
[46] Architecture of Geospatial Big-Data Batch Processing Model Based on Hadoop
Kim, Sang-Su
Yu, Sung-Hwan
[J]. 2015 INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC), 2015, : 964 - 966
[47] Dense or Sparse : Elastic SPMM Implementation for Optimal Big-Data Processing
Choi, Unho
Lee, Kyungyong
[J]. IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (02) : 637 - 652
[48] Harmony: An Approach for Geo-distributed Processing of Big-Data Applications
Zhang, Han
Ramapantulu, Lavanya
Teo, Yong Meng
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 160 - 170
[49] Becoming data-savvy in a big-data world
Xu, Meng
Rhee, Seung Yon
[J]. TRENDS IN PLANT SCIENCE, 2014, 19 (10) : 619 - 622
[50] Interpreting big-data analysis of retrospective observational data
Huizinga, Tom W. J.
Knevel, Rachel
[J]. LANCET RHEUMATOLOGY, 2020, 2 (11): : E652 - E653

← 1 2 3 4 5 →