Multi-file Queries Performance Improvement through Data Placement in Hadoop

被引:0
|
作者
Tang, Yu [1 ]
Abdulhay, Elham [1 ]
Fan, Aihua
Su, Sheng [1 ]
Gebreselassie, Kidus [1 ]
机构
[1] Univ Elect Sci & Technol China, Chengdu 611731, Peoples R China
关键词
HDFS; Block Placement; Data locality; Correlation;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Hadoop is enjoying popularity for processing data-intensive jobs because of its data locality feature. However, the performance gained from Hadoop's above feature is currently limited by its default block placement policy, which implicitly assumes instances of MapReduce jobs access data from a single file. On the contrary, multi-file queries like indexing query or aggregation query need to process related data from more than one files found on different DataNodes of a cluster. In this paper we proposed a Correlation-based Block Placement (CBP) Algorithm that enhances the performance of these queries by placing related blocks on the same set of DataNodes. Furthermore, we developed a customized InputFormat that enables InputSplits contain records from different files. Simulation results demonstrated that the number of migrating data blocks for CBP was insignificant. On the contrary, for default policy case, the number of migrating data blocks increased significantly with the input dataset size. As a result, for any input dataset size, the performance of CBP exceeded that of the default policy.
引用
收藏
页码:986 / 991
页数:6
相关论文
共 50 条
  • [1] Big Data: Mining of Log File through Hadoop
    Kotiyal, Bina
    Kumar, Ankit
    Pant, Bhaskar
    Goudar, R. H.
    [J]. 2013 INTERNATIONAL CONFERENCE ON HUMAN COMPUTER INTERACTIONS (ICHCI), 2013,
  • [2] Hadoop I/O Performance Improvement by File Layout Optimization
    Fujishima, Eita
    Nakashima, Kenji
    Yamaguchi, Saneyasu
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2018, E101D (02): : 415 - 427
  • [3] Data prefetching and file synchronizing for performance optimization in Hadoop-based hybrid cloud
    Li, Chunlin
    Zhang, Jing
    Chen, Yi
    Luo, Youlong
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2019, 151 : 133 - 149
  • [5] Big Data Performance Analysis on a Hadoop Distributed File System Based on Geometric Data Perturbation Technique
    Marichamy, V. Santhana
    Natarajan, V.
    [J]. 2ND INTERNATIONAL CONFERENCE ON RECENT TRENDS IN ADVANCED COMPUTING ICRTAC -DISRUP - TIV INNOVATION , 2019, 2019, 165 : 415 - 420
  • [6] Performance-Aware Data Placement in Hybrid Parallel File Systems
    He, Shuibing
    Sun, Xian-He
    Feng, Bo
    Feng, Kun
    [J]. ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2014, PT I, 2014, 8630 : 563 - 576
  • [7] Replica-aware data recovery performance improvement for Hadoop system with NVM
    Xin Li
    Huijie Li
    Youyou Lu
    Yanchao Zhao
    Xiaolin Qin
    [J]. CCF Transactions on High Performance Computing, 2021, 3 : 144 - 156
  • [8] Replica-aware data recovery performance improvement for Hadoop system with NVM
    Li, Xin
    Li, Huijie
    Lu, Youyou
    Zhao, Yanchao
    Qin, Xiaolin
    [J]. CCF TRANSACTIONS ON HIGH PERFORMANCE COMPUTING, 2021, 3 (02) : 144 - 156
  • [9] High Performance and Fault Tolerant Distributed File System for Big Data Storage and Processing using Hadoop
    Sivaraman, E.
    Manickachezian, R.
    [J]. 2014 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING APPLICATIONS (ICICA 2014), 2014, : 32 - 36
  • [10] Big Data Performance Analysis on a Hadoop Distributed File System Based on Modified Partitional Clustering Algorithm
    Marichamy, V. Santhana
    Natarajan, V
    [J]. SUSTAINABLE COMMUNICATION NETWORKS AND APPLICATION, ICSCN 2019, 2020, 39 : 461 - 468