In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model

被引：48

作者：

Huang, Wei ^{[1
]}

Meng, Lingkui ^{[1
]}

Zhang, Dongying ^{[1
]}

Zhang, Wen ^{[1
]}

机构：

[1] Wuhan Univ, Sch Remote Sensing & Informat Engn, Wuhan 430079, Peoples R China

来源：

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING | 2017年 / 10卷 / 01期

关键词：

Apache Spark; big data; Hadoop yet another resource negotiator (YARN); parallel processing; remote sensing (RS); FRACTIONAL VEGETATION COVER; SENSING DATA; PERFORMANCE; MODIS; CHALLENGES; MAPREDUCE; FRAMEWORK; SYSTEM; ALBEDO; INDEX;

D O I：

10.1109/JSTARS.2016.2547020

中图分类号：

TM [电工技术]; TN [电子技术、通信技术];

学科分类号：

0808 ; 0809 ;

摘要：

MapReduce has been widely used in Hadoop for parallel processing larger-scale data for the last decade. However, remote-sensing (RS) algorithms based on the programming model are trapped in dense disk I/O operations and unconstrained network communication, and thus inappropriate for timely processing and analyzing massive, heterogeneous RS data. In this paper, a novel in-memory computing framework called Apache Spark (Spark) is introduced. Through its merits of transferring transformation to in-memory datasets of Spark, the shortages are eliminated. To facilitate implementation and assure high performance of Spark-based algorithms in a complex cloud computing environment, a strip-oriented parallel programming model is proposed. By incorporating strips of RS data with resilient distributed datasets (RDDs) of Spark, all-level parallel RS algorithms can be easily expressed with coarse-grained transformation primitives and BitTorrent-enabled broadcast variables. Additionally, a generic image partition method for Spark-based RS algorithms to efficiently generate differentiable key/value strips from a Hadoop distributed file system (HDFS) is implemented for concealing the heterogeneousness of RS data. Data-intensive multitasking algorithms and iteration-intensive algorithms were evaluated on a Hadoop yet another resource negotiator (YARN) platform. Experiments indicated that our Spark-based parallel algorithms are of great efficiency, a multitasking algorithm took less than 4 h to process more than half a terabyte of RS data on a small YARN cluster, and 9*9 convolution operations against a 909-MB image took less than 260 s. Further, the efficiency of iteration-intensive algorithms is insensitive to image size.

引用

页码：3 / 19

页数：17

共 50 条

[1] Consideration of Parallel Data Processing over an Apache Spark Cluster
Kato, Kasumi
Takefusa, Atsuko
Nakada, Hidemoto
Oguchi, Masato
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4757 - 4759
[2] GPU in-memory processing using Spark for iterative computation
Hong, Sumin
Choi, Woohyuk
Jeong, Won-Ki
[J]. 2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2017, : 31 - 41
[3] Processing of Big Educational Data in the Cloud Using Apache Hadoop
Machova, Renata
Komarkova, Jitka
Lnenicka, Martin
[J]. INTERNATIONAL CONFERENCE ON INFORMATION SOCIETY (I-SOCIETY 2016), 2016, : 46 - 49
[4] Machine-Learning Based Memory Prediction Model for Data Parallel Workloads in Apache Spark
Myung, Rohyoung
Choi, Sukyong
[J]. SYMMETRY-BASEL, 2021, 13 (04):
[5] Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark
Belov, Vladimir
Tatarintsev, Andrey
Nikulchev, Evgeny
[J]. SYMMETRY-BASEL, 2021, 13 (02): : 1 - 20
[6] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
N. Ahmed
Andre L. C. Barczak
Teo Susnjak
Mohammed A. Rashid
[J]. Journal of Big Data, 7
[7] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
Ahmed, N.
Barczak, Andre L. C.
Susnjak, Teo
Rashid, Mohammed A.
[J]. JOURNAL OF BIG DATA, 2020, 7 (01)
[8] Typhoon Quantitative Rainfall Prediction from Big Data Analytics by Using the Apache Hadoop Spark Parallel Computing Framework
Wei, Chih-Chiang
Chou, Tzu-Hao
[J]. ATMOSPHERE, 2020, 11 (08)
[9] Parallel processing of remotely sensed data: Application to the ATSR-2 instrument
Simpson, J.
McIntire, T.
Berg, J.
Tsou, Y.
[J]. INFRARED PHYSICS & TECHNOLOGY, 2007, 49 (03) : 317 - 320
[10] Recent Developments in Parallel and Distributed Computing for Remotely Sensed Big Data Processing
Wu, Zebin
Sun, Jin
Zhang, Yi
Wei, Zhihui
Chanussot, Jocelyn
[J]. PROCEEDINGS OF THE IEEE, 2021, 109 (08) : 1282 - 1305

← 1 2 3 4 5 →