MapReduce Across Distributed Clusters for Data-intensive Applications

被引：21

作者：

Wang, Lizhe ^{[1
]}

Tao, Jie ^{[2
]}

Marten, Holger ^{[2
]}

Streit, Achim ^{[2
]}

Khan, Samee U. ^{[3
]}

Kolodziej, Joanna ^{[4
]}

Chen, Dan ^{[5
]}

机构：

[1] Chinese Acad Sci, Ctr Earth Observat & Digital Earth, Beijing 100864, Peoples R China

[2] Steinbuch Ctr Comp, Karlsruhe Inst Technol, Karlsruhe, Germany

[3] North Dakota State Univ, Dept Elect & Comp Engn, Fargo, ND 58105 USA

[4] Univ Bielsko Biala, Dept Math & Comp Sci, Bielsko Biala, Poland

[5] China Univ Geosci, Sch Comp, Beijing, Peoples R China

来源：

2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW) | 2012年

关键词：

MapReduce; Hadoop; Data Intensive Computing; CLOUD;

D O I：

10.1109/IPDPSW.2012.249

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Recently, the computational requirements for largescale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge amount of data are processed on more than 140 computing centers distributed across 34 countries. The MapReduce paradigm has emerged as a highly successful programming model for large-scale data-intensive computing applications. However, current MapReduce implementations are developed to operate on single cluster environments and cannot be leveraged for large-scale distributed data processing across multiple clusters. On the other hand, workflow systems are used for distributed data processing across data centers. It has been reported that the workflow paradigm has some limitations for distributed data processing, such as reliability and efficiency. In this paper, we present the design and implementation of G-Hadoop, a MapReduce framework that aims to enable large-scale distributed computing across multiple clusters. G-Hadoop uses the Gfarm file system as an underlying file system and executes MapReduce tasks across distributed clusters. Experiments of the G-Hadoop framework on distributed clusters show encouraging results.

引用

页码：2004 / 2011

页数：8

共 50 条

[1] G-Hadoop: MapReduce across distributed data centers for data-intensive computing
Wang, Lizhe
Tao, Jie
Ranjan, Rajiv
Marten, Holger
Streit, Achim
Chen, Jingying
Chen, Dan
[J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2013, 29 (03): : 739 - 750
[2] Accelerating Biomedical Data-Intensive Applications using MapReduce
Han, Liangxiu
Ong, Hwee Yong
[J]. 2012 ACM/IEEE 13TH INTERNATIONAL CONFERENCE ON GRID COMPUTING (GRID), 2012, : 49 - 57
[3] Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications
Ahmad, Maaz Bin Safeer
Cheung, Alvin
[J]. SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 1205 - 1220
[4] An Efficiency-Aware Scheduling for Data-Intensive Computations on MapReduce Clusters
Zhao, Hui
Yang, Shuqiang
Fan, Hua
Chen, Zhikun
Xu, Jinghu
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (12): : 2654 - 2662
[5] Citus: Distributed PostgreSQL for Data-Intensive Applications
Cubukcu, Umur
Erdogan, Ozgun
Pathak, Sumedh
Sannakkayala, Sudhakar
Slot, Marco
[J]. SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2490 - 2502
[6] Understanding performance of distributed data-intensive applications
Miceli, Christopher
Miceli, Michael
Rodriguez-Milla, Bety
Jha, Shantenu
[J]. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2010, 368 (1926): : 4089 - 4102
[7] Data-Intensive Text Processing with MapReduce
Xu, Peng
[J]. COMPUTATIONAL LINGUISTICS, 2011, 37 (03) : 635 - 637
[8] Design of Self-Adjusting algorithm for data-intensive MapReduce Applications
Nagiwale, Amin Nazir
Umale, Manish R.
Sinha, Aditya Kumar
[J]. 2015 INTERNATIONAL CONFERENCE ON ENERGY SYSTEMS AND APPLICATIONS, 2015, : 506 - 510
[9] CoLoc: Distributed Data and Container Colocation for Data-Intensive Applications
Renner, Thomas
Thamsen, Lauritz
Kao, Odej
[J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3008 - 3015
[10] Bucket MapReduce: Relieving the Disk I/O Intensity of Data-Intensive Applications in MapReduce Frameworks
Chen, Kai-Hsun
Chen, Hsin-Yuan
Wang, Chien-Min
[J]. 2021 29TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2021), 2021, : 18 - 25

← 1 2 3 4 5 →