Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data

被引:14
|
作者
Hu, Fei [1 ,2 ]
Xu, Mengchao [1 ,2 ]
Yang, Jingchao [1 ,2 ]
Liang, Yanshou [1 ,2 ]
Cui, Kejin [1 ,2 ]
Little, Michael M. [3 ]
Lynnes, Christopher S. [3 ]
Duffy, Daniel Q. [3 ]
Yang, Chaowei [1 ,2 ]
机构
[1] George Mason Univ, NSF Spatiotemporal Innovat Ctr, Fairfax, VA 22030 USA
[2] George Mason Univ, Dept Geog & GeoInformat Sci, Fairfax, VA 22030 USA
[3] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA
来源
基金
美国国家科学基金会;
关键词
big data; data container; geospatial raster data management; GIS; SYSTEM; PERFORMANCE;
D O I
10.3390/ijgi7040144
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Big geospatial raster data pose a grand challenge to data management technologies for effective big data query and processing. To address these challenges, various big data container solutions have been developed or enhanced to facilitate data storage, retrieval, and analysis. Data containers were also developed or enhanced to handle geospatial data. For example, Rasdaman was developed to handle raster data and GeoSpark/SpatialHadoop were enhanced from Spark/Hadoop to handle vector data. However, there are few studies to systematically compare and evaluate the features and performances of these popular data containers. This paper provides a comprehensive evaluation of six popular data containers (i.e., Rasdaman, SciDB, Spark, ClimateSpark, Hive, and MongoDB) for handling multi-dimensional, array-based geospatial raster datasets. Their architectures, technologies, capabilities, and performance are compared and evaluated from two perspectives: (a) system design and architecture (distributed architecture, logical data model, physical data model, and data operations); and (b) practical use experience and performance (data preprocessing, data uploading, query speed, and resource consumption). Four major conclusions are offered: (1) no data containers, except ClimateSpark, have good support for the HDF data format used in this paper, requiring time- and resource-consuming data preprocessing to load data; (2) SciDB, Rasdaman, and MongoDB handle small/mediate volumes of data query well, whereas Spark and ClimateSpark can handle large volumes of data with stable resource consumption; (3) SciDB and Rasdaman provide mature array-based data operation and analytical functions, while the others lack these functions for users; and (4) SciDB, Spark, and Hive have better support of user defined functions (UDFs) to extend the system capability.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] Open Source Initiatives for Big Data Governance and Security: A Survey
    HU Baiqing
    WANG Wenjie
    Chi Harold Liu
    ZTECommunications, 2018, 16 (02) : 55 - 66
  • [42] Open Source Big Data Analytics Frameworks Written in Scala
    Miller, John A.
    Bowman, Casey
    Harish, Vishnu Gowda
    Quinn, Shannon
    2016 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2016, 2016, : 389 - 393
  • [43] A study of software reliability on big data open source software
    Kumar, Ranjan
    Kumar, Subhash
    Tiwari, Sanjay K.
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2019, 10 (02) : 242 - 250
  • [44] A study of software reliability on big data open source software
    Ranjan Kumar
    Subhash Kumar
    Sanjay K. Tiwari
    International Journal of System Assurance Engineering and Management, 2019, 10 : 242 - 250
  • [45] Big-data platform based on open source ecosystem
    Lei J.
    Ye H.
    Wu Z.
    Zhang P.
    Xie L.
    He Y.
    1600, Science Press (54): : 80 - 93
  • [46] Big Data Analytics: A Preliminary Study of Open Source Platforms
    Nereu, Jorge
    Almeida, Ana
    Bernardino, Jorge
    ICSOFT: PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON SOFTWARE TECHNOLOGIES, 2017, : 435 - 440
  • [47] Comparison of Big Data Analyses for Reliable Open Source Software
    Tamura, Yoshinobu
    Yamada, Shigeru
    2016 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT (IEEM), 2016, : 1345 - 1349
  • [48] Industrial Big Data Platform Based on Open Source Software
    Yang, Wen
    Haider, Syed Naeem
    Zou, Jian-hong
    Zhao, Qian-chuan
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER NETWORKS AND COMMUNICATION TECHNOLOGY (CNCT 2016), 2016, 54 : 649 - 658
  • [49] A study of handling missing data methods for big data
    Ezzine, Imane
    Benhlima, Laila
    2018 IEEE 5TH INTERNATIONAL CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'18), 2018, : 498 - 501
  • [50] Distributed Zonal Statistics of Big Raster and Vector Data
    Singla, Samriddhi
    Eldawy, Ahmed
    26TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2018), 2018, : 536 - 539