Efficient and Scalable Functional Dependency Discovery on Distributed Data-Parallel Platforms

被引:3
|
作者
Zhu, Guanghui [1 ]
Wang, Qian [1 ]
Tang, Qiwei [1 ]
Gu, Rong [1 ]
Yuan, Chunfeng [1 ]
Huang, Yihua [1 ]
机构
[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing 210008, Jiangsu, Peoples R China
基金
中国国家自然科学基金; 中国博士后科学基金;
关键词
Distributed databases; Scalability; Remuneration; Lattices; Distributed algorithms; Switches; Query processing; Functional dependency discovery; distributed computing; data-parallel algorithms; ALGORITHM;
D O I
10.1109/TPDS.2019.2925014
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Functional dependencies (FDs) play a very important role in many data management tasks such as schema normalization, data cleaning, and query optimization. Meanwhile, there are ever-increasing application demands for efficient FD discovery on large-scale datasets. Unfortunately, due to huge runtime and memory overhead, the existing single-machine FD discovery algorithms are inefficient for large-scale datasets. Recently, distributed data-parallel computing has become the de facto standard for large-scale data processing. However, it is challenging to design an efficient distributed FD discovery algorithm. In this paper, we present SmartFD, which is an efficient and scalable algorithm for distributed FD discovery. First, we propose a novel attribute sorting-based algorithm framework. Next, to discover all the FDs grouped by a given attribute, we propose an efficient distributed algorithm Attribute-centric Functional Dependency Discovery (AFDD). In AFDD, we design an Fast Sampling and Early Aggregation (FSEA) mechanism to improve the efficiency of distributed sampling and propose a memory-efficient index-based method for distributed FD validation. Moreover, AFDD employs an attribute-parallel method to accelerate the pruning-and-generation of candidate FDs. Furthermore, we propose an adaptive switching strategy between distributed sampling and distributed validation based on the unified time-based efficiency metric. Also, we employ a distributed probing based method to make the switching strategy more accurate. Experimental results on Apache Spark reveal that SmartFD outperforms the state-of-the-art single-machine algorithm HyFD and the existing distributed algorithm HFDD with 3.2 & x00D7;-44.9 & x00D7; and 2.5 & x00D7;-455.7 & x00D7; speedup respectively. Moreover, SmartFD achieves good row scalability and column scalability. Additionally, SmartFD has sub-linear node scalability.
引用
收藏
页码:2663 / 2676
页数:14
相关论文
共 50 条
  • [1] DGST: Efficient and scalable suffix tree construction on distributed data-parallel platforms
    Zhu, Guanghui
    Guo, Chen
    Lu, Le
    Huang, Zhi
    Yuan, Chunfeng
    Gu, Rong
    Huang, Yihua
    [J]. PARALLEL COMPUTING, 2019, 87 : 87 - 102
  • [2] Efficient Data-parallel Computations on Distributed Systems
    曾志勇
    [J]. High Technology Letters, 2002, (03) : 92 - 96
  • [3] Energy-Efficient Execution of Data-Parallel Applications on Heterogeneous Mobile Platforms
    Prakash, Alok
    Wang, Siqi
    Irimiea, Alexandru Eugen
    Mitra, Tulika
    [J]. 2015 33RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2015, : 208 - 215
  • [4] Optical interconnectivity in a scalable data-parallel system
    Dines, JAB
    Snowdon, JF
    Desmulliez, MPY
    Barsky, DB
    Shafarenko, AV
    Jesshope, CR
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1997, 41 (01) : 120 - 130
  • [5] Scalable Random Forest with Data-Parallel Computing
    Vazquez-Novoa, Fernando
    Conejero, Javier
    Tatu, Cristian
    Badia, Rosa M.
    [J]. EURO-PAR 2023: PARALLEL PROCESSING, 2023, 14100 : 397 - 410
  • [6] Parallelizing Machine Learning Optimization Algorithms on Distributed Data-Parallel Platforms with Parameter Server
    Gu, Rong
    Fan, Shiqing
    Hu, Qiu
    Yuan, Chunfeng
    Huang, Yihua
    [J]. 2018 IEEE 24TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED SYSTEMS (ICPADS 2018), 2018, : 126 - 133
  • [7] SparkDQ: Efficient generic big data quality management on distributed data-parallel computation
    Gu, Rong
    Qi, Yang
    Wu, Tongyu
    Wang, Zhaokang
    Xu, Xiaolong
    Yuan, Chunfeng
    Huang, Yihua
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 156 (156) : 132 - 147
  • [8] Improving Execution Concurrency of Large-Scale Matrix Multiplication on Distributed Data-Parallel Platforms
    Gu, Rong
    Tang, Yun
    Tian, Chen
    Zhou, Hucheng
    Li, Guanru
    Zheng, Xudong
    Huang, Yihua
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (09) : 2539 - 2552
  • [9] Efficient conditional operations for data-parallel architectures
    Kapasi, UJ
    Dally, WJ
    Rixner, S
    Mattson, PR
    Owens, JD
    Khailany, B
    [J]. 33RD ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE: MICRO-33 2000, PROCEEDINGS, 2000, : 159 - 170
  • [10] Efficient Data-Parallel Primitives on Heterogeneous Systems
    Lai, Zhuohang
    Luo, Qiong
    Xie, Xiaolong
    [J]. PROCEEDINGS OF THE 48TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP 2019), 2019,