AutoMJ: Towards Efficient Multi-way Join Query on Distributed Data-parallel Platform

被引：1

作者：

Zhu, Guanghui ^{[1
]}

Wu, Xiaoqi ^{[1
]}

Gu, Rong ^{[1
]}

Yuan, Chunfeng ^{[1
]}

Huang, Yihua ^{[1
]}

机构：

[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Collaborat Innovat Ctr Novel Software Technol & I, Nanjing 210023, Jiangsu, Peoples R China

来源：

2017 IEEE 23RD INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS) | 2017年

基金：

中国国家自然科学基金;

关键词：

multi-way join; HyperCube shuffle; distributed computing; join size estimation; Apache Spark;

D O I：

10.1109/ICPADS.2017.00032

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The multi-way join query has attracted considerable attention from research community for its importance in many big data analytic applications. For the multi-round multi-way join algorithm in distributed data-parallel platforms, the huge communication cost caused by shuffling large intermediate results over the network is the main bottleneck. The one-round multi-way join algorithm processes the join query in a single communication round, which can significantly reduce the communication cost in complex queries, including cyclic queries. However, the one-round method is not always superior to the multi-round method, because the intermediate result size of the multi-round method may the much smaller than the size of data shuffled in the one-round method. Therefore, it is challenging to choose the best multi-way join algorithm in practice. To solve this problem, in this paper, we present AutoMJ, an efficient framework for multi-way join queries. In AutoMJ, we propose a novel automatic join strategy selection model based on the size estimation of intermediate join results. AutoMJ chooses the multi-way join strategy with the minimal shuffle data size. In addition, we propose an optimized HyperCube algorithm for the one-round multi-way join. We have implemented the prototype of AutoMJ on the widely-used distributed data-parallel platform Apache Spark. Experiments show that for multi-way join queries with large intermediate results, the one-round join strategy can outperform the multi-round join strategy built in Spark SQL 1.2-159.3x faster. In contrast, the multi-round join strategy is 2.1-6.2 x faster than the one-round method for the queries with small intermediate results. Experiments also show that the relative error of size estimation can be within 0.1 for the Twitter dataset and 0.25 for the Wikidata dataset. Furthermore, experiments verify that the automatic join strategy selection model is effective for choosing the optimal multi-way join algorithm.

引用

页码：161 / 169

页数：9

共 41 条

[1] An algorithm for multi-way distance join query
Liang, Yin
Zhang, Hong
[J]. 2006 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-6, PROCEEDINGS, 2006, : 412 - +
[2] HyMJ: A Hybrid Structure Aware Approach to Distributed Multi-Way Join Query
Zhu, Guanghui
Wu, Xiaoqi
Yin, Liangliang
Wang, Haogang
Gu, Rong
Yuan, Chunfeng
Huang, Yihua
[J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 1726 - 1729
[3] Towards a Multi-way Similarity Join Operator
Galkin, Mikhail
Vidal, Maria-Esther
Auer, Soeren
[J]. NEW TRENDS IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2017, 2017, 767 : 267 - 274
[4] Towards Multi-way Join Aware Optimizer in SAP HANA
Wi, Sungheun
Han, Wook-Shin
Chang, Chuho
Kim, Kihong
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (12): : 3019 - 3031
[5] Towards Multi-way Join Aware Optimizer in SAP HANA
Wi, Sungheun
Han, Wook-Shin
Chang, Chuho
Kim, Kihong
[J]. Proceedings of the VLDB Endowment, 2020, 13 (12): : 3019 - 3031
[6] Distributed Spatial Join Processing for Multiple Spatial Datasets - Multi-way Spatial Join
Cunha, Anderson R.
de Oliveira, Savio S. T.
de Oliveira, Thiago B.
Aleixo, Everton L.
Cardoso, Marcelo de C.
do Sacramento Rodrigues, Vagner J.
[J]. 2015 XXXIII BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS, 2015, : 171 - 181
[7] Using slice join for efficient evaluation of multi-way joins
Lawrence, Ramon
[J]. DATA & KNOWLEDGE ENGINEERING, 2008, 67 (01) : 118 - 139
[8] Efficient Data-parallel Computations on Distributed Systems
曾志勇
[J]. High Technology Letters, 2002, (03) : 92 - 96
[9] Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks
Han, Hyuck
Jung, Hyungsoo
Eom, Hyeonsang
Yeom, Heon Y.
[J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2011, 14 (02): : 183 - 197
[10] Scatter-Gather-Merge: An efficient star-join query processing algorithm for data-parallel frameworks
Hyuck Han
Hyungsoo Jung
Hyeonsang Eom
Heon Y. Yeom
[J]. Cluster Computing, 2011, 14 : 183 - 197

← 1 2 3 4 5 →