MapReduce Across Distributed Clusters for Data-intensive Applications

被引:21
|
作者
Wang, Lizhe [1 ]
Tao, Jie [2 ]
Marten, Holger [2 ]
Streit, Achim [2 ]
Khan, Samee U. [3 ]
Kolodziej, Joanna [4 ]
Chen, Dan [5 ]
机构
[1] Chinese Acad Sci, Ctr Earth Observat & Digital Earth, Beijing 100864, Peoples R China
[2] Steinbuch Ctr Comp, Karlsruhe Inst Technol, Karlsruhe, Germany
[3] North Dakota State Univ, Dept Elect & Comp Engn, Fargo, ND 58105 USA
[4] Univ Bielsko Biala, Dept Math & Comp Sci, Bielsko Biala, Poland
[5] China Univ Geosci, Sch Comp, Beijing, Peoples R China
关键词
MapReduce; Hadoop; Data Intensive Computing; CLOUD;
D O I
10.1109/IPDPSW.2012.249
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently, the computational requirements for largescale data-intensive analysis of scientific data have grown significantly. In High Energy Physics (HEP) for example, the Large Hadron Collider (LHC) produced 13 petabytes of data in 2010. This huge amount of data are processed on more than 140 computing centers distributed across 34 countries. The MapReduce paradigm has emerged as a highly successful programming model for large-scale data-intensive computing applications. However, current MapReduce implementations are developed to operate on single cluster environments and cannot be leveraged for large-scale distributed data processing across multiple clusters. On the other hand, workflow systems are used for distributed data processing across data centers. It has been reported that the workflow paradigm has some limitations for distributed data processing, such as reliability and efficiency. In this paper, we present the design and implementation of G-Hadoop, a MapReduce framework that aims to enable large-scale distributed computing across multiple clusters. G-Hadoop uses the Gfarm file system as an underlying file system and executes MapReduce tasks across distributed clusters. Experiments of the G-Hadoop framework on distributed clusters show encouraging results.
引用
收藏
页码:2004 / 2011
页数:8
相关论文
共 50 条
  • [1] G-Hadoop: MapReduce across distributed data centers for data-intensive computing
    Wang, Lizhe
    Tao, Jie
    Ranjan, Rajiv
    Marten, Holger
    Streit, Achim
    Chen, Jingying
    Chen, Dan
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2013, 29 (03): : 739 - 750
  • [2] Accelerating Biomedical Data-Intensive Applications using MapReduce
    Han, Liangxiu
    Ong, Hwee Yong
    [J]. 2012 ACM/IEEE 13TH INTERNATIONAL CONFERENCE ON GRID COMPUTING (GRID), 2012, : 49 - 57
  • [3] Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications
    Ahmad, Maaz Bin Safeer
    Cheung, Alvin
    [J]. SIGMOD'18: PROCEEDINGS OF THE 2018 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2018, : 1205 - 1220
  • [4] An Efficiency-Aware Scheduling for Data-Intensive Computations on MapReduce Clusters
    Zhao, Hui
    Yang, Shuqiang
    Fan, Hua
    Chen, Zhikun
    Xu, Jinghu
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2013, E96D (12): : 2654 - 2662
  • [5] Citus: Distributed PostgreSQL for Data-Intensive Applications
    Cubukcu, Umur
    Erdogan, Ozgun
    Pathak, Sumedh
    Sannakkayala, Sudhakar
    Slot, Marco
    [J]. SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2490 - 2502
  • [6] Understanding performance of distributed data-intensive applications
    Miceli, Christopher
    Miceli, Michael
    Rodriguez-Milla, Bety
    Jha, Shantenu
    [J]. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2010, 368 (1926): : 4089 - 4102
  • [7] Data-Intensive Text Processing with MapReduce
    Xu, Peng
    [J]. COMPUTATIONAL LINGUISTICS, 2011, 37 (03) : 635 - 637
  • [8] Design of Self-Adjusting algorithm for data-intensive MapReduce Applications
    Nagiwale, Amin Nazir
    Umale, Manish R.
    Sinha, Aditya Kumar
    [J]. 2015 INTERNATIONAL CONFERENCE ON ENERGY SYSTEMS AND APPLICATIONS, 2015, : 506 - 510
  • [9] CoLoc: Distributed Data and Container Colocation for Data-Intensive Applications
    Renner, Thomas
    Thamsen, Lauritz
    Kao, Odej
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3008 - 3015
  • [10] Bucket MapReduce: Relieving the Disk I/O Intensity of Data-Intensive Applications in MapReduce Frameworks
    Chen, Kai-Hsun
    Chen, Hsin-Yuan
    Wang, Chien-Min
    [J]. 2021 29TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING (PDP 2021), 2021, : 18 - 25