A Data Locality Optimization Algorithm for Large-scale Data Processing in Hadoop

被引:0
|
作者
Zhao, Yanrong [1 ]
Wang, Weiping [1 ]
Meng, Dan [1 ]
Yang, Xiufeng [1 ]
Zhang, Shubin [2 ]
Li, Jun [2 ]
Guan, Gang [2 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Beijing, Peoples R China
[2] Data Platform Dept, Shenzhen, Peoples R China
基金
美国国家科学基金会;
关键词
Hadoop; MapReduce; join query;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data-intensive applications are increasingly designed to execute on large computing clusters. Our previous observation on Tencent production systems has indicated that join query is one of the most important queries in large-scale data processing. When running a join query on Hive system, the job of the join query is divided into map phase and reduce phase, and requires transferring large amounts of intermediate results over the network, which is inefficient. In this paper, we proposed an algorithm called CHMJ, the general idea of the algorithm is to take advantage of data locality to accelerate calculation. It includes four parts, Data distribution strategy, Parallel HashMapJoin Algorithm, CoLocation Scheduling and Delay scheduling strategy. CHMJ has been adopted in Tencent data warehouse, and plays an important role in Tencent's daily operations. Our relevant experiments demonstrate the feasibility and efficiency of our solution.
引用
收藏
页码:655 / 661
页数:7
相关论文
共 50 条
  • [1] Optimization of hadoop cluster for analyzing large-scale sequence data in bioinformatics
    Toth, Adam
    Karimi, Ramin
    [J]. ANNALES MATHEMATICAE ET INFORMATICAE, 2019, 50 : 187 - 202
  • [2] Hadoop-EDF: Large-scale Distributed Processing of Electrophysiological Signal Data in Hadoop MapReduce
    Wu, Yuanyuan
    Li, Xiaojin
    Liu, Jinze
    Cui, Licong
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 2265 - 2271
  • [3] Hadoop-HBase for Large-Scale Data
    Vora, Mehul Nalin
    [J]. 2011 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), VOLS 1-4, 2012, : 601 - 605
  • [4] Towards a framework for large-scale multimedia data storage and processing on Hadoop platform
    Lai, Wei Kuang
    Chen, Yi-Uan
    Wu, Tin-Yu
    Obaidat, Mohammad S.
    [J]. JOURNAL OF SUPERCOMPUTING, 2014, 68 (01): : 488 - 507
  • [5] Towards a framework for large-scale multimedia data storage and processing on Hadoop platform
    Wei Kuang Lai
    Yi-Uan Chen
    Tin-Yu Wu
    Mohammad S. Obaidat
    [J]. The Journal of Supercomputing, 2014, 68 : 488 - 507
  • [6] An efficient algorithm for Kriging approximation and optimization with large-scale sampling data
    Sakata, S
    Ashida, F
    Zako, M
    [J]. COMPUTER METHODS IN APPLIED MECHANICS AND ENGINEERING, 2004, 193 (3-5) : 385 - 404
  • [7] A Structure Optimization Algorithm of Neural Networks for Large-Scale Data Sets
    Yang, Jie
    Ma, Jun
    Berryman, Matthew
    Perez, Pascal
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2014, : 956 - 961
  • [8] A novel locality-sensitive hashing algorithm for similarity searches on large-scale hyperspectral data
    Zhou, Yuan
    Liu, Chun
    Li, Nan
    Li, Minzhen
    [J]. REMOTE SENSING LETTERS, 2016, 7 (10) : 965 - 974
  • [9] Active disks for large-scale data processing
    Riedel, E
    Faloutsos, C
    Gibson, GA
    Nagle, D
    [J]. COMPUTER, 2001, 34 (06) : 68 - +
  • [10] AUTOMATING LARGE-SCALE PROCESSING OF DOSIMETRY DATA
    PAWLYK, DA
    SIEGEL, JA
    SHARKEY, RM
    GOLDENBERG, DM
    [J]. JOURNAL OF NUCLEAR MEDICINE, 1993, 34 (05): : P160 - P160