Enabling Scientific Data Storage and Processing on Big-data Systems

被引:0
|
作者
Biookaghazadeh, Saman [1 ]
Xu, Yiqi [2 ]
Zhou, Shujia [3 ]
Zhao, Ming [1 ]
机构
[1] Arizona State Univ, Sch Comp Informat & Decis Syst Engn, Tempe, AZ 85287 USA
[2] Florida Int Univ, Sch Comp & Informat Sci, Miami, FL USA
[3] Northrop Grumman Informat Technol, Colorado Springs, CO USA
关键词
Scientific data; big data; NetCDF; Hadoop;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains and prevents scientific users from leveraging these systems to improve their productivity. This paper presents a solution to this problem by enabling big-data systems to directly store and process scientific data. Specifically, it enables Hadoop to efficiently store NetCDF data on HDFS and process them in MapReduce using convenient APIs. It also enables Hive to support standard queries on NetCDF data, transparently to users. The paper also presents an evaluation of the proposed solution using several representative queries on a typical geoscientific dataset. The results show that the proposed approach achieves substantial speedup (up to 20 times) and space saving (83% reduction), compared to the traditional approach which has to convert NetCDF data to CSV format for Hadoop and Hive to use them.
引用
收藏
页码:1978 / 1984
页数:7
相关论文
共 50 条
  • [31] Lessons for big-data projects
    Ewan Birney
    Nature, 2012, 489 : 49 - 51
  • [32] A Consistent Approach to Building Secure Big Data Processing and Storage Systems
    Poltavtseva, M. A.
    AUTOMATIC CONTROL AND COMPUTER SCIENCES, 2019, 53 (08) : 914 - 921
  • [33] A Consistent Approach to Building Secure Big Data Processing and Storage Systems
    M. A. Poltavtseva
    Automatic Control and Computer Sciences, 2019, 53 : 914 - 921
  • [34] Biology must develop its own big-data systems
    Boyle, John
    NATURE, 2013, 499 (7456) : 7 - 7
  • [35] Optimization and Control for Systems in the Big-Data Era: Theory and Applications
    Batabyal, Amitrajeet A.
    Shen, Wenjing
    INFORMS JOURNAL ON APPLIED ANALYTICS, 2021, 51 (03): : 242 - 244
  • [36] Libra and the Art of Task Sizing in Big-Data Analytic Systems
    Li, Rui
    Guo, Peizhen
    Hu, Bo
    Hu, Wenjun
    PROCEEDINGS OF THE 2019 TENTH ACM SYMPOSIUM ON CLOUD COMPUTING (SOCC '19), 2019, : 364 - 376
  • [37] Biology must develop its own big-data systems
    John Boyle
    Nature, 2013, 499 : 7 - 7
  • [38] Dense or Sparse : Elastic SPMM Implementation for Optimal Big-Data Processing
    Choi, Unho
    Lee, Kyungyong
    IEEE TRANSACTIONS ON BIG DATA, 2023, 9 (02) : 637 - 652
  • [39] Architecture of Geospatial Big-Data Batch Processing Model Based on Hadoop
    Kim, Sang-Su
    Yu, Sung-Hwan
    2015 INTERNATIONAL CONFERENCE ON ICT CONVERGENCE (ICTC), 2015, : 964 - 966
  • [40] Harmony: An Approach for Geo-distributed Processing of Big-Data Applications
    Zhang, Han
    Ramapantulu, Lavanya
    Teo, Yong Meng
    2019 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2019, : 160 - 170