Cost-Based Join Algorithm Selection in Hadoop

被引:0
|
作者
Gu, Jun [1 ]
Peng, Shu [1 ]
Wang, X. Sean [1 ]
Rao, Weixiong [2 ]
Yang, Min [1 ]
Cao, Yu [3 ]
机构
[1] Fudan Univ, Sch Comp Sci, Shanghai 200433, Peoples R China
[2] Tongji Univ, Sch Software Engn, Shanghai, Peoples R China
[3] EMC Labs, Beijing, Peoples R China
关键词
Join algorithm; Cost model; Hadoop; Hive;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, MapReduce has become a popular computing framework for big data analysis. Join is a major query type for data analysis and various algorithms have been designed to process join queries on top of Hadoop. Since the efficiency of different algorithms differs on the join tasks on hand, to achieve a good performance, users need to select an appropriate algorithm and use the algorithm with a proper configuration, which is rather difficult for many end users. This paper proposes a cost model to estimate the cost of four popular join algorithms. Based on the cost model, the system may automatically choose the join algorithm with the least cost, and then give the reasonable configuration values for the chosen algorithm. Experimental results with the TPC-H benchmark verify that the proposed method can correctly choose the best join algorithm, and the chosen algorithm can achieve a speedup of around 1.25 times over the default join algorithm.
引用
收藏
页码:246 / 261
页数:16
相关论文
共 50 条
  • [21] Cost-based scheduling algorithm for workflow-based application in optical grid
    Zhang, Lingzhi
    Guo, Wei
    Jin, Yaohui
    Sun, Weiqiang
    Hu, Weisheng
    [J]. 2011 ASIA COMMUNICATIONS AND PHOTONICS CONFERENCE AND EXHIBITION (ACP), 2012,
  • [22] A prediction-based and cost-based replica replacement algorithm research and simulation
    Ma, T
    Luo, JZ
    [J]. 19TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOL 1, PROCEEDINGS: AINA 2005, 2005, : 935 - 940
  • [23] Cost-based feature selection for GIS-embedded data fusion
    Smits, PC
    Annoni, A
    [J]. IGARSS 2000: IEEE 2000 INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, VOL I - VI, PROCEEDINGS, 2000, : 2614 - 2616
  • [24] An Efficient Improved Join Algorithm Using Map Reduce in Hadoop
    Patel, Warish D.
    Vaghela, Dineshkumar B.
    [J]. 2014 INTERNATIONAL CONFERENCE ON SIGNAL PROPAGATION AND COMPUTER TECHNOLOGY (ICSPCT 2014), 2014, : 263 - 272
  • [25] A modified tabu search algorithm for cost-based job shop problem
    Zhu, Z. C.
    Ng, K. M.
    Ong, H. L.
    [J]. JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 2010, 61 (04) : 611 - 619
  • [26] An algorithm for finding MAPs for belief networks through cost-based abduction
    Abdelbar, AM
    [J]. ARTIFICIAL INTELLIGENCE, 1998, 104 (1-2) : 331 - 338
  • [27] A cost-based scheduling algorithm for differentiated service on WDM optical networks
    Ma, M
    [J]. IEEE COMMUNICATIONS LETTERS, 2003, 7 (09) : 460 - 462
  • [28] Elements of cost-based tolerancing
    Youngworth, RN
    Stone, BD
    [J]. OPTICAL REVIEW, 2001, 8 (04) : 276 - 280
  • [29] A cost-based online scheduling algorithm for job assignment on computational grids
    Weng, CL
    Lu, XD
    [J]. ADVANCED PARALLEL PROCESSING TECHNOLOGIES, PROCEEDINGS, 2003, 2834 : 343 - 351
  • [30] A Cost-Based Distributed Algorithm for Load Balancing in Content Delivery Network
    Shuai, Qianjun
    Wang, Keqin
    Miao, Fang
    Jin, Libiao
    [J]. 2017 NINTH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC 2017), VOL 1, 2017, : 11 - 15