Evaluating the Open Source Data Containers for Handling Big Geospatial Raster Data

被引：14

作者：

Hu, Fei ^{[1
,2
]}

Xu, Mengchao ^{[1
,2
]}

Yang, Jingchao ^{[1
,2
]}

Liang, Yanshou ^{[1
,2
]}

Cui, Kejin ^{[1
,2
]}

Little, Michael M. ^{[3
]}

Lynnes, Christopher S. ^{[3
]}

Duffy, Daniel Q. ^{[3
]}

Yang, Chaowei ^{[1
,2
]}

机构：

[1] George Mason Univ, NSF Spatiotemporal Innovat Ctr, Fairfax, VA 22030 USA

[2] George Mason Univ, Dept Geog & GeoInformat Sci, Fairfax, VA 22030 USA

[3] NASA, Goddard Space Flight Ctr, Greenbelt, MD 20771 USA

来源：

ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION | 2018年 / 7卷 / 04期

基金：

美国国家科学基金会;

关键词：

big data; data container; geospatial raster data management; GIS; SYSTEM; PERFORMANCE;

D O I：

10.3390/ijgi7040144

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Big geospatial raster data pose a grand challenge to data management technologies for effective big data query and processing. To address these challenges, various big data container solutions have been developed or enhanced to facilitate data storage, retrieval, and analysis. Data containers were also developed or enhanced to handle geospatial data. For example, Rasdaman was developed to handle raster data and GeoSpark/SpatialHadoop were enhanced from Spark/Hadoop to handle vector data. However, there are few studies to systematically compare and evaluate the features and performances of these popular data containers. This paper provides a comprehensive evaluation of six popular data containers (i.e., Rasdaman, SciDB, Spark, ClimateSpark, Hive, and MongoDB) for handling multi-dimensional, array-based geospatial raster datasets. Their architectures, technologies, capabilities, and performance are compared and evaluated from two perspectives: (a) system design and architecture (distributed architecture, logical data model, physical data model, and data operations); and (b) practical use experience and performance (data preprocessing, data uploading, query speed, and resource consumption). Four major conclusions are offered: (1) no data containers, except ClimateSpark, have good support for the HDF data format used in this paper, requiring time- and resource-consuming data preprocessing to load data; (2) SciDB, Rasdaman, and MongoDB handle small/mediate volumes of data query well, whereas Spark and ClimateSpark can handle large volumes of data with stable resource consumption; (3) SciDB and Rasdaman provide mature array-based data operation and analytical functions, while the others lack these functions for users; and (4) SciDB, Spark, and Hive have better support of user defined functions (UDFs) to extend the system capability.

引用

页数：22

共 50 条

[41] Open Source Initiatives for Big Data Governance and Security: A Survey
HU Baiqing
WANG Wenjie
Chi Harold Liu
ZTECommunications, 2018, 16 (02) : 55 - 66
[42] Open Source Big Data Analytics Frameworks Written in Scala
Miller, John A.
Bowman, Casey
Harish, Vishnu Gowda
Quinn, Shannon
2016 IEEE INTERNATIONAL CONGRESS ON BIG DATA - BIGDATA CONGRESS 2016, 2016, : 389 - 393
[43] A study of software reliability on big data open source software
Kumar, Ranjan
Kumar, Subhash
Tiwari, Sanjay K.
INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2019, 10 (02) : 242 - 250
[44] A study of software reliability on big data open source software
Ranjan Kumar
Subhash Kumar
Sanjay K. Tiwari
International Journal of System Assurance Engineering and Management, 2019, 10 : 242 - 250
[45] Big-data platform based on open source ecosystem
Lei J.
Ye H.
Wu Z.
Zhang P.
Xie L.
He Y.
1600, Science Press (54): : 80 - 93
[46] Big Data Analytics: A Preliminary Study of Open Source Platforms
Nereu, Jorge
Almeida, Ana
Bernardino, Jorge
ICSOFT: PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON SOFTWARE TECHNOLOGIES, 2017, : 435 - 440
[47] Comparison of Big Data Analyses for Reliable Open Source Software
Tamura, Yoshinobu
Yamada, Shigeru
2016 IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT (IEEM), 2016, : 1345 - 1349
[48] Industrial Big Data Platform Based on Open Source Software
Yang, Wen
Haider, Syed Naeem
Zou, Jian-hong
Zhao, Qian-chuan
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER NETWORKS AND COMMUNICATION TECHNOLOGY (CNCT 2016), 2016, 54 : 649 - 658
[49] A study of handling missing data methods for big data
Ezzine, Imane
Benhlima, Laila
2018 IEEE 5TH INTERNATIONAL CONGRESS ON INFORMATION SCIENCE AND TECHNOLOGY (IEEE CIST'18), 2018, : 498 - 501
[50] Distributed Zonal Statistics of Big Raster and Vector Data
Singla, Samriddhi
Eldawy, Ahmed
26TH ACM SIGSPATIAL INTERNATIONAL CONFERENCE ON ADVANCES IN GEOGRAPHIC INFORMATION SYSTEMS (ACM SIGSPATIAL GIS 2018), 2018, : 536 - 539

← 1 2 3 4 5 →