Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms

被引：3

作者：

Zhu, Guanghui ^{[1
]}

Wang, Qian ^{[1
]}

Tang, Qiwei ^{[1
]}

Gu, Rong ^{[1
]}

Yuan, Chunfeng ^{[1
]}

Huang, Yihua ^{[1
]}

机构：

[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210008, Jiangsu, Peoples R China

来源：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS | 2019年 / 30卷 / 12期

基金：

中国国家自然科学基金; 中国博士后科学基金;

关键词：

Distributed databases; Scalability; Remuneration; Lattices; Distributed algorithms; Switches; Query processing; Functional dependency discovery; distributed computing; data-parallel algorithms; ALGORITHM;

D O I：

10.1109/TPDS.2019.2925014

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Functional dependencies (FDs) play a very important role in many data management tasks such as schema normalization, data cleaning, and query optimization. Meanwhile, there are ever-increasing application demands for efficient FD discovery on large-scale datasets. Unfortunately, due to huge runtime and memory overhead, the existing single-machine FD discovery algorithms are inefficient for large-scale datasets. Recently, distributed data-parallel computing has become the de facto standard for large-scale data processing. However, it is challenging to design an efficient distributed FD discovery algorithm. In this paper, we present SmartFD, which is an efficient and scalable algorithm for distributed FD discovery. First, we propose a novel attribute sorting-based algorithm framework. Next, to discover all the FDs grouped by a given attribute, we propose an efficient distributed algorithm Attribute-centric Functional Dependency Discovery (AFDD). In AFDD, we design an Fast Sampling and Early Aggregation (FSEA) mechanism to improve the efficiency of distributed sampling and propose a memory-efficient index-based method for distributed FD validation. Moreover, AFDD employs an attribute-parallel method to accelerate the pruning-and-generation of candidate FDs. Furthermore, we propose an adaptive switching strategy between distributed sampling and distributed validation based on the unified time-based efficiency metric. Also, we employ a distributed probing based method to make the switching strategy more accurate. Experimental results on Apache Spark reveal that SmartFD outperforms the state-of-the-art single-machine algorithm HyFD and the existing distributed algorithm HFDD with 3.2 & x00D7;-44.9 & x00D7; and 2.5 & x00D7;-455.7 & x00D7; speedup respectively. Moreover, SmartFD achieves good row scalability and column scalability. Additionally, SmartFD has sub-linear node scalability.

引用

页码：2663 / 2676

页数：14

共 50 条

[41] DAME: An environment for preserving the efficiency of data-parallel computations on distributed systems
Colajanni, M
Cermele, M
[J]. IEEE CONCURRENCY, 1997, 5 (01): : 41 - &
[42] Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview
Thamsen, Lauritz
Scheinert, Dominik
Will, Jonathan
Bader, Jonathan
Kao, Odej
[J]. Datenbank-Spektrum, 2022, 22 (02) : 143 - 151
[43] AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training
Chen, Chia-Yu
Choi, Jungwook
Brand, Daniel
Agrawal, Ankur
Zhang, Wei
Gopalakrishnan, Kailash
[J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2827 - 2835
[44] Repairing Functional Dependency Violations in Distributed Data
Chen, Qing
Tan, Zijing
He, Chu
Sha, Chaofeng
Wang, Wei
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT1, 2015, 9049 : 441 - 457
[45] GDS: General Distributed Strategy for Functional Dependency Discovery Algorithms
Wu, Peizhong
Yang, Wei
Wang, Haichuan
Huang, Liusheng
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2020), PT I, 2020, 12112 : 270 - 278
[46] Analytical Communication Performance Models as a metric in the partitioning of data-parallel kernels on heterogeneous platforms
Rico-Gallego, Juan A.
Diaz-Martin, Juan C.
Calvo-Jurado, Carmen
Moreno-Alvarez, Sergio
Garcia-Zapata, Juan L.
[J]. JOURNAL OF SUPERCOMPUTING, 2019, 75 (03): : 1654 - 1669
[47] Analytical Communication Performance Models as a metric in the partitioning of data-parallel kernels on heterogeneous platforms
Juan A. Rico-Gallego
Juan C. Díaz-Martín
Carmen Calvo-Jurado
Sergio Moreno-Álvarez
Juan L. García-Zapata
[J]. The Journal of Supercomputing, 2019, 75 : 1654 - 1669
[48] Efficient Data-Parallel Algorithm for Elevation Color Generation in Terrain Rendering
Zhou, Zhengkang
Wang, Jian
Zhao, Huacheng
[J]. MECHANICAL ENGINEERING AND GREEN MANUFACTURING II, PTS 1 AND 2, 2012, 155-156 : 37 - 41
[49] A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms
Khaleghzadeh, Hamidreza
Manumachu, Ravindranath Reddy
Lastovetsky, Alexey
[J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (10) : 2176 - 2190
[50] SCALABLE DATA-PARALLEL ALGORITHMS FOR TEXTURE SYNTHESIS USING GIBBS RANDOM-FIELDS
BADER, DA
JALA, J
CHELLAPPA, R
[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 1995, 4 (10) : 1456 - 1460

← 1 2 3 4 5 →