Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms

被引:3
|
作者
Zhu, Guanghui [1 ]
Wang, Qian [1 ]
Tang, Qiwei [1 ]
Gu, Rong [1 ]
Yuan, Chunfeng [1 ]
Huang, Yihua [1 ]
机构
[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210008, Jiangsu, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Distributed databases; Scalability; Remuneration; Lattices; Distributed algorithms; Switches; Query processing; Functional dependency discovery; distributed computing; data-parallel algorithms; ALGORITHM;
D O I
10.1109/TPDS.2019.2925014
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Functional dependencies (FDs) play a very important role in many data management tasks such as schema normalization, data cleaning, and query optimization. Meanwhile, there are ever-increasing application demands for efficient FD discovery on large-scale datasets. Unfortunately, due to huge runtime and memory overhead, the existing single-machine FD discovery algorithms are inefficient for large-scale datasets. Recently, distributed data-parallel computing has become the de facto standard for large-scale data processing. However, it is challenging to design an efficient distributed FD discovery algorithm. In this paper, we present SmartFD, which is an efficient and scalable algorithm for distributed FD discovery. First, we propose a novel attribute sorting-based algorithm framework. Next, to discover all the FDs grouped by a given attribute, we propose an efficient distributed algorithm Attribute-centric Functional Dependency Discovery (AFDD). In AFDD, we design an Fast Sampling and Early Aggregation (FSEA) mechanism to improve the efficiency of distributed sampling and propose a memory-efficient index-based method for distributed FD validation. Moreover, AFDD employs an attribute-parallel method to accelerate the pruning-and-generation of candidate FDs. Furthermore, we propose an adaptive switching strategy between distributed sampling and distributed validation based on the unified time-based efficiency metric. Also, we employ a distributed probing based method to make the switching strategy more accurate. Experimental results on Apache Spark reveal that SmartFD outperforms the state-of-the-art single-machine algorithm HyFD and the existing distributed algorithm HFDD with 3.2 & x00D7;-44.9 & x00D7; and 2.5 & x00D7;-455.7 & x00D7; speedup respectively. Moreover, SmartFD achieves good row scalability and column scalability. Additionally, SmartFD has sub-linear node scalability.
引用
收藏
页码:2663 / 2676
页数:14
相关论文
共 50 条
  • [41] DAME: An environment for preserving the efficiency of data-parallel computations on distributed systems
    Colajanni, M
    Cermele, M
    [J]. IEEE CONCURRENCY, 1997, 5 (01): : 41 - &
  • [42] Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview
    Thamsen, Lauritz
    Scheinert, Dominik
    Will, Jonathan
    Bader, Jonathan
    Kao, Odej
    [J]. Datenbank-Spektrum, 2022, 22 (02) : 143 - 151
  • [43] AdaComp: Adaptive Residual Gradient Compression for Data-Parallel Distributed Training
    Chen, Chia-Yu
    Choi, Jungwook
    Brand, Daniel
    Agrawal, Ankur
    Zhang, Wei
    Gopalakrishnan, Kailash
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2827 - 2835
  • [44] Repairing Functional Dependency Violations in Distributed Data
    Chen, Qing
    Tan, Zijing
    He, Chu
    Sha, Chaofeng
    Wang, Wei
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT1, 2015, 9049 : 441 - 457
  • [45] GDS: General Distributed Strategy for Functional Dependency Discovery Algorithms
    Wu, Peizhong
    Yang, Wei
    Wang, Haichuan
    Huang, Liusheng
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2020), PT I, 2020, 12112 : 270 - 278
  • [46] Analytical Communication Performance Models as a metric in the partitioning of data-parallel kernels on heterogeneous platforms
    Rico-Gallego, Juan A.
    Diaz-Martin, Juan C.
    Calvo-Jurado, Carmen
    Moreno-Alvarez, Sergio
    Garcia-Zapata, Juan L.
    [J]. JOURNAL OF SUPERCOMPUTING, 2019, 75 (03): : 1654 - 1669
  • [47] Analytical Communication Performance Models as a metric in the partitioning of data-parallel kernels on heterogeneous platforms
    Juan A. Rico-Gallego
    Juan C. Díaz-Martín
    Carmen Calvo-Jurado
    Sergio Moreno-Álvarez
    Juan L. García-Zapata
    [J]. The Journal of Supercomputing, 2019, 75 : 1654 - 1669
  • [48] Efficient Data-Parallel Algorithm for Elevation Color Generation in Terrain Rendering
    Zhou, Zhengkang
    Wang, Jian
    Zhao, Huacheng
    [J]. MECHANICAL ENGINEERING AND GREEN MANUFACTURING II, PTS 1 AND 2, 2012, 155-156 : 37 - 41
  • [49] A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms
    Khaleghzadeh, Hamidreza
    Manumachu, Ravindranath Reddy
    Lastovetsky, Alexey
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2018, 29 (10) : 2176 - 2190
  • [50] SCALABLE DATA-PARALLEL ALGORITHMS FOR TEXTURE SYNTHESIS USING GIBBS RANDOM-FIELDS
    BADER, DA
    JALA, J
    CHELLAPPA, R
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 1995, 4 (10) : 1456 - 1460