In-Memory Parallel Processing of Massive Remotely Sensed Data Using an Apache Spark on Hadoop YARN Model

被引:48
|
作者
Huang, Wei [1 ]
Meng, Lingkui [1 ]
Zhang, Dongying [1 ]
Zhang, Wen [1 ]
机构
[1] Wuhan Univ, Sch Remote Sensing & Informat Engn, Wuhan 430079, Peoples R China
关键词
Apache Spark; big data; Hadoop yet another resource negotiator (YARN); parallel processing; remote sensing (RS); FRACTIONAL VEGETATION COVER; SENSING DATA; PERFORMANCE; MODIS; CHALLENGES; MAPREDUCE; FRAMEWORK; SYSTEM; ALBEDO; INDEX;
D O I
10.1109/JSTARS.2016.2547020
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
MapReduce has been widely used in Hadoop for parallel processing larger-scale data for the last decade. However, remote-sensing (RS) algorithms based on the programming model are trapped in dense disk I/O operations and unconstrained network communication, and thus inappropriate for timely processing and analyzing massive, heterogeneous RS data. In this paper, a novel in-memory computing framework called Apache Spark (Spark) is introduced. Through its merits of transferring transformation to in-memory datasets of Spark, the shortages are eliminated. To facilitate implementation and assure high performance of Spark-based algorithms in a complex cloud computing environment, a strip-oriented parallel programming model is proposed. By incorporating strips of RS data with resilient distributed datasets (RDDs) of Spark, all-level parallel RS algorithms can be easily expressed with coarse-grained transformation primitives and BitTorrent-enabled broadcast variables. Additionally, a generic image partition method for Spark-based RS algorithms to efficiently generate differentiable key/value strips from a Hadoop distributed file system (HDFS) is implemented for concealing the heterogeneousness of RS data. Data-intensive multitasking algorithms and iteration-intensive algorithms were evaluated on a Hadoop yet another resource negotiator (YARN) platform. Experiments indicated that our Spark-based parallel algorithms are of great efficiency, a multitasking algorithm took less than 4 h to process more than half a terabyte of RS data on a small YARN cluster, and 9*9 convolution operations against a 909-MB image took less than 260 s. Further, the efficiency of iteration-intensive algorithms is insensitive to image size.
引用
收藏
页码:3 / 19
页数:17
相关论文
共 50 条
  • [1] Consideration of Parallel Data Processing over an Apache Spark Cluster
    Kato, Kasumi
    Takefusa, Atsuko
    Nakada, Hidemoto
    Oguchi, Masato
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 4757 - 4759
  • [2] GPU in-memory processing using Spark for iterative computation
    Hong, Sumin
    Choi, Woohyuk
    Jeong, Won-Ki
    [J]. 2017 17TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2017, : 31 - 41
  • [3] Processing of Big Educational Data in the Cloud Using Apache Hadoop
    Machova, Renata
    Komarkova, Jitka
    Lnenicka, Martin
    [J]. INTERNATIONAL CONFERENCE ON INFORMATION SOCIETY (I-SOCIETY 2016), 2016, : 46 - 49
  • [4] Machine-Learning Based Memory Prediction Model for Data Parallel Workloads in Apache Spark
    Myung, Rohyoung
    Choi, Sukyong
    [J]. SYMMETRY-BASEL, 2021, 13 (04):
  • [5] Choosing a Data Storage Format in the Apache Hadoop System Based on Experimental Evaluation Using Apache Spark
    Belov, Vladimir
    Tatarintsev, Andrey
    Nikulchev, Evgeny
    [J]. SYMMETRY-BASEL, 2021, 13 (02): : 1 - 20
  • [6] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    N. Ahmed
    Andre L. C. Barczak
    Teo Susnjak
    Mohammed A. Rashid
    [J]. Journal of Big Data, 7
  • [7] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    Ahmed, N.
    Barczak, Andre L. C.
    Susnjak, Teo
    Rashid, Mohammed A.
    [J]. JOURNAL OF BIG DATA, 2020, 7 (01)
  • [8] Typhoon Quantitative Rainfall Prediction from Big Data Analytics by Using the Apache Hadoop Spark Parallel Computing Framework
    Wei, Chih-Chiang
    Chou, Tzu-Hao
    [J]. ATMOSPHERE, 2020, 11 (08)
  • [9] Parallel processing of remotely sensed data: Application to the ATSR-2 instrument
    Simpson, J.
    McIntire, T.
    Berg, J.
    Tsou, Y.
    [J]. INFRARED PHYSICS & TECHNOLOGY, 2007, 49 (03) : 317 - 320
  • [10] Recent Developments in Parallel and Distributed Computing for Remotely Sensed Big Data Processing
    Wu, Zebin
    Sun, Jin
    Zhang, Yi
    Wei, Zhihui
    Chanussot, Jocelyn
    [J]. PROCEEDINGS OF THE IEEE, 2021, 109 (08) : 1282 - 1305