AutoMJ: Towards Efficient Multi-way Join Query on Distributed Data-parallel Platform

被引:1
|
作者
Zhu, Guanghui [1 ]
Wu, Xiaoqi [1 ]
Gu, Rong [1 ]
Yuan, Chunfeng [1 ]
Huang, Yihua [1 ]
机构
[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Collaborat Innovat Ctr Novel Software Technol & I, Nanjing 210023, Jiangsu, Peoples R China
基金
中国国家自然科学基金;
关键词
multi-way join; HyperCube shuffle; distributed computing; join size estimation; Apache Spark;
D O I
10.1109/ICPADS.2017.00032
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The multi-way join query has attracted considerable attention from research community for its importance in many big data analytic applications. For the multi-round multi-way join algorithm in distributed data-parallel platforms, the huge communication cost caused by shuffling large intermediate results over the network is the main bottleneck. The one-round multi-way join algorithm processes the join query in a single communication round, which can significantly reduce the communication cost in complex queries, including cyclic queries. However, the one-round method is not always superior to the multi-round method, because the intermediate result size of the multi-round method may the much smaller than the size of data shuffled in the one-round method. Therefore, it is challenging to choose the best multi-way join algorithm in practice. To solve this problem, in this paper, we present AutoMJ, an efficient framework for multi-way join queries. In AutoMJ, we propose a novel automatic join strategy selection model based on the size estimation of intermediate join results. AutoMJ chooses the multi-way join strategy with the minimal shuffle data size. In addition, we propose an optimized HyperCube algorithm for the one-round multi-way join. We have implemented the prototype of AutoMJ on the widely-used distributed data-parallel platform Apache Spark. Experiments show that for multi-way join queries with large intermediate results, the one-round join strategy can outperform the multi-round join strategy built in Spark SQL 1.2-159.3x faster. In contrast, the multi-round join strategy is 2.1-6.2 x faster than the one-round method for the queries with small intermediate results. Experiments also show that the relative error of size estimation can be within 0.1 for the Twitter dataset and 0.25 for the Wikidata dataset. Furthermore, experiments verify that the automatic join strategy selection model is effective for choosing the optimal multi-way join algorithm.
引用
收藏
页码:161 / 169
页数:9
相关论文
共 41 条
  • [21] Complexity of estimating multi-way join result sizes for area skewed spatial data
    Park, HH
    Chung, CW
    [J]. INFORMATION PROCESSING LETTERS, 2000, 76 (03) : 121 - 129
  • [22] Efficient distance join query processing in distributed spatial data management systems
    Garcia-Garcia, Francisco
    Corral, Antonio
    Iribarne, Luis
    Vassilakopoulos, Michael
    Manolopoulos, Yannis
    [J]. INFORMATION SCIENCES, 2020, 512 : 985 - 1008
  • [23] SparkDQ: Efficient generic big data quality management on distributed data-parallel computation
    Gu, Rong
    Qi, Yang
    Wu, Tongyu
    Wang, Zhaokang
    Xu, Xiaolong
    Yuan, Chunfeng
    Huang, Yihua
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 156 (156) : 132 - 147
  • [24] Generalized communication cost efficient multi-way spatial join: revisiting the curse of the last reducer
    Bhattu, S. Nagesh
    Potluri, Avinash
    Kadari, Prashanth
    Subramanyam, R. B. V.
    [J]. GEOINFORMATICA, 2020, 24 (03) : 557 - 589
  • [25] Generalized communication cost efficient multi-way spatial join: revisiting the curse of the last reducer
    S. Nagesh Bhattu
    Avinash Potluri
    Prashanth Kadari
    Subramanyam R. B. V.
    [J]. GeoInformatica, 2020, 24 : 557 - 589
  • [26] DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms
    Zhu, Guanghui
    Guo, Chen
    Lu, Le
    Huang, Zhi
    Yuan, Chunfeng
    Gu, Rong
    Huang, Yihua
    [J]. PARALLEL COMPUTING, 2019, 87 : 87 - 102
  • [27] ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
    Bo Zhao
    Hucheng Zhou
    Guoqiang Li
    Yihua Huang
    [J]. Big Data Mining and Analytics, 2018, (01) : 57 - 74
  • [28] ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform
    Zhao, Bo
    Zhou, Hucheng
    Li, Guoqiang
    Huang, Yihua
    [J]. BIG DATA MINING AND ANALYTICS, 2018, 1 (01): : 57 - 74
  • [29] A Multi-way Semi-stream Join for a Near-Real-Time Data Warehouse
    Naeem, M. Asif
    Nguyen, Kim Tung
    Weber, Gerald
    [J]. DATABASES THEORY AND APPLICATIONS, ADC 2017, 2017, 10538 : 59 - 70
  • [30] Optimizing Multi-way Theta Join for Data Skew in Sub-second Stream Computing
    Fan, Xiaopeng
    Liu, Xinchun
    Wang, Yang
    Wang, Youjun
    Li, Jing
    [J]. 2020 IEEE 26TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS), 2020, : 476 - 485