Enabling Scientific Data Storage and Processing on Big-data Systems

被引:0
|
作者
Biookaghazadeh, Saman [1 ]
Xu, Yiqi [2 ]
Zhou, Shujia [3 ]
Zhao, Ming [1 ]
机构
[1] Arizona State Univ, Sch Comp Informat & Decis Syst Engn, Tempe, AZ 85287 USA
[2] Florida Int Univ, Sch Comp & Informat Sci, Miami, FL USA
[3] Northrop Grumman Informat Technol, Colorado Springs, CO USA
关键词
Scientific data; big data; NetCDF; Hadoop;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Big-data systems are increasingly important for solving the data-driven problems in many science domains including geosciences. However, existing big-data systems cannot support the self-describing data formats such as NetCDF which are commonly used by scientific communities for data distribution and sharing. This limitation presents a serious hurdle to the further adoption of big-data systems by science domains and prevents scientific users from leveraging these systems to improve their productivity. This paper presents a solution to this problem by enabling big-data systems to directly store and process scientific data. Specifically, it enables Hadoop to efficiently store NetCDF data on HDFS and process them in MapReduce using convenient APIs. It also enables Hive to support standard queries on NetCDF data, transparently to users. The paper also presents an evaluation of the proposed solution using several representative queries on a typical geoscientific dataset. The results show that the proposed approach achieves substantial speedup (up to 20 times) and space saving (83% reduction), compared to the traditional approach which has to convert NetCDF data to CSV format for Hadoop and Hive to use them.
引用
收藏
页码:1978 / 1984
页数:7
相关论文
共 50 条
  • [41] Advancing manufacturing systems with big-data analytics: A conceptual framework
    Kozjek, Dominik
    Vrabic, Rok
    Rihtarsic, Borut
    Lavrac, Nada
    Butala, Peter
    INTERNATIONAL JOURNAL OF COMPUTER INTEGRATED MANUFACTURING, 2020, 33 (02) : 169 - 188
  • [42] Efficient Storage of Big-Data for Real-Time GPS Applications
    Akulakrishna, Pavan Kumar
    Lakshmi, J.
    Nandy, S. K.
    2014 IEEE FOURTH INTERNATIONAL CONFERENCE ON BIG DATA AND CLOUD COMPUTING (BDCLOUD), 2014, : 1 - 8
  • [43] Becoming data-savvy in a big-data world
    Xu, Meng
    Rhee, Seung Yon
    TRENDS IN PLANT SCIENCE, 2014, 19 (10) : 619 - 622
  • [44] Interpreting big-data analysis of retrospective observational data
    Huizinga, Tom W. J.
    Knevel, Rachel
    LANCET RHEUMATOLOGY, 2020, 2 (11): : E652 - E653
  • [45] "I-Care" - Big-data Analytics for Intelligent Systems
    Singh, Paras Nath
    2021 8TH INTERNATIONAL CONFERENCE ON SMART COMPUTING AND COMMUNICATIONS (ICSCC), 2021, : 225 - 229
  • [46] Block-based Realtime Big-Data Processing for Smart Cities
    Bonino, Dario
    Rizzo, Federico
    Pastrone, Claudio
    Soto, Jose Angel Carvajal
    Ahlsen, Matts
    Axling, Mathias
    IEEE SECOND INTERNATIONAL SMART CITIES CONFERENCE (ISC2 2016), 2016, : 208 - 213
  • [47] Parallel Job Processing Technique for Real-time Big-Data Processing Framework
    Son, Jae Gi
    Kang, Ji-Woo
    An, Jae-Hoon
    Ahn, Hyung-Joo
    Chun, Hyo-Jung
    Kim, Jung-Guk
    2016 RESEARCH IN ADAPTIVE AND CONVERGENT SYSTEMS, 2016, : 226 - 229
  • [48] On Efficient Hierarchical Storage for Big Data Processing
    Krish, K. R.
    Wadhwa, Bharti
    Iqbal, M. Safdar
    Rafique, M. Mustafa
    Butt, Ali R.
    2016 16TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2016, : 403 - 408
  • [49] Implementation of a Distributed Processing Engine for Spatial Big-Data Processing based on Batch and Stream
    Kim, Sang-Su
    Song, Kwaun-Sik
    2017 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2017, : 1196 - 1198
  • [50] Analysis of Big-Data Based Data Mining Engine
    Huang, Xinxin
    Gong, Shu
    2017 13TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY (CIS), 2017, : 164 - 168