An Efficient and Scalable Algorithm to Mine Functional Dependencies from Distributed Big Data

被引:0
|
作者
Wu, Wanqing [1 ,2 ]
Mao, Wenyu [1 ,2 ]
机构
[1] Hebei Univ, Coll Cyber Secur & Comp, Baoding 071000, Peoples R China
[2] Hebei Univ, Key Lab High Trusted Informat Syst Hebei Prov, Baoding 071000, Peoples R China
关键词
data mining; functional dependency; distributed computing; big data; DISCOVERY; APPROXIMATE;
D O I
10.3390/s22103856
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
A crucial step in improving data quality is to discover semantic relationships between data. Functional dependencies are rules that describe semantic relationships between data in relational databases and have been applied to improve data quality recently. However, traditional functional discovery algorithms applied to distributed data may lead to errors and the inability to scale to large-scale data. To solve the above problems, we propose a novel distributed functional dependency discovery algorithm based on Apache Spark, which can effectively discover functional dependencies in large-scale data. The basic idea is to use data redistribution to discover functional dependencies in parallel on multiple nodes. In this algorithm, we take a sampling approach to quickly remove invalid functional dependencies and propose a greedy-based task assignment strategy to balance the load. In addition, the prefix tree is used to store intermediate computation results during the validation process to avoid repeated computation of equivalence classes. Experimental results on real and synthetic datasets show that the proposed algorithm in this paper is more efficient than existing methods while ensuring accuracy.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Scalable Functional Dependencies Discovery from Big Data
    Tu Shouzhong
    Huang Minlie
    [J]. 2016 IEEE SECOND INTERNATIONAL CONFERENCE ON MULTIMEDIA BIG DATA (BIGMM), 2016, : 426 - 431
  • [2] An Efficient Distributed Algorithm for Big Data Processing
    Mohammed S. Al-kahtani
    Lutful Karim
    [J]. Arabian Journal for Science and Engineering, 2017, 42 : 3149 - 3157
  • [3] An Efficient Distributed Algorithm for Big Data Processing
    Al-kahtani, Mohammed S.
    Karim, Lutful
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2017, 42 (08) : 3149 - 3157
  • [4] Scalable and Hierarchical Distributed Data Structures for Efficient Big Data Management
    Sioutas, Spyros
    Vonitsanos, Gerasimos
    Zacharatos, Nikolaos
    Zaroliagis, Christos
    [J]. ALGORITHMIC ASPECTS OF CLOUD COMPUTING (ALGOCLOUD 2019), 2020, 12041 : 122 - 160
  • [5] A scalable and distributed dendritic cell algorithm for big data classification
    Dagdia, Zaineb Chelly
    [J]. SWARM AND EVOLUTIONARY COMPUTATION, 2019, 50
  • [6] Scalable Data Exchange with Functional Dependencies
    Marnette, Bruno
    Mecca, Giansalvatore
    Papotti, Paolo
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (01): : 105 - 116
  • [7] FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark
    Feng Cheng
    Zhe Yang
    [J]. The Journal of Supercomputing, 2019, 75 : 2497 - 2517
  • [8] FastMFDs: a fast, efficient algorithm for mining minimal functional dependencies from large-scale distributed data with Spark
    Cheng, Feng
    Yang, Zhe
    [J]. JOURNAL OF SUPERCOMPUTING, 2019, 75 (05): : 2497 - 2517
  • [9] An efficient and scalable privacy preserving algorithm for big data and data streams
    Chamikara, M. A. P.
    Bertok, P.
    Liu, D.
    Camtepe, S.
    Khalil, I
    [J]. COMPUTERS & SECURITY, 2019, 87
  • [10] Functional Dependencies Unleashed for Scalable Data Exchange
    Bonifati, Angela
    Ileana, Ioana
    Linardi, Michele
    [J]. 28TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT (SSDBM) 2016), 2016,