Scalable Similarity Joins for Fast and Accurate Record Deduplication in Big Data

被引:0
|
作者
Rozinek, Ondrej [1 ]
Borkovcova, Monika [2 ]
Mares, Jan [1 ,3 ]
机构
[1] Department of Process Control, University of Pardubice, Studentska 95, Pardubice,532 10, Czech Republic
[2] Department of Information Technology, University of Pardubice, Studentska 95, Pardubice,532 10, Czech Republic
[3] Department of Mathematics, Informatics and Cybernetics, University of Chemistry and Technology Prague, Technicka 5, Prague,166 28, Czech Republic
来源
关键词
Engineering Village;
D O I
暂无
中图分类号
学科分类号
摘要
Bipartite matchings - Data-source - Deduplication - Entity resolutions - Matchings - Q-gram filters - Record deduplication - Record linkage - Similarity join - Similarity spaces
引用
收藏
页码:181 / 191
相关论文
共 50 条
  • [21] Scalable algorithms for signal reconstruction by leveraging similarity joins
    Asudeh, Abolfazl
    Augustine, Jees
    Nazi, Azade
    Thirumuruganathan, Saravanan
    Zhang, Nan
    Das, Gautam
    Srivastava, Divesh
    VLDB JOURNAL, 2020, 29 (2-3): : 681 - 707
  • [22] Deduplication on Encrypted Big Data in Cloud
    Yan, Zheng
    Ding, Wenxiu
    Yu, Xixun
    Zhu, Haiqi
    Deng, Robert H.
    IEEE Transactions on Big Data, 2016, 2 (02): : 138 - 150
  • [23] Similarity based deduplication with small data chunks
    Aronovich, L.
    Asher, R.
    Harnik, D.
    Hirsch, M.
    Klein, S. T.
    Toaff, Y.
    DISCRETE APPLIED MATHEMATICS, 2016, 212 : 10 - 22
  • [24] Similarity Based Deduplication with Small Data Chunks
    Aronovich, Lior
    Asher, Ron
    Harnik, Danny
    Hirsch, Michael
    Klein, Shmuel T.
    Toaff, Yair
    PROCEEDINGS OF THE PRAGUE STRINGOLOGY CONFERENCE 2012, 2012, : 3 - 17
  • [25] Data Distribution for Fast Joins
    Libkin, Leonid
    COMMUNICATIONS OF THE ACM, 2017, 60 (03) : 92 - 92
  • [26] Fast and Accurate Estimates of Divergence Times from Big Data
    Mello, Beatriz
    Tao, Qiqing
    Tamura, Koichiro
    Kumar, Sudhir
    MOLECULAR BIOLOGY AND EVOLUTION, 2017, 34 (01) : 45 - 50
  • [27] MR-SimLab: Scalable subgraph selection with label similarity for big data
    Dhifli, Wajdi
    Aridhi, Sabeur
    Nguifo, Engelbert Mephu
    INFORMATION SYSTEMS, 2017, 69 : 155 - 163
  • [28] Fast, scalable and geo-distributed PCA for big data analytics
    Adnan, T. M. Tariq
    Tanjim, Md Mehrab
    Adnan, Muhammad Abdullah
    INFORMATION SYSTEMS, 2021, 98 (98)
  • [29] Fast and Scalable Big Data Trajectory Clustering for Understanding Urban Mobility
    Kumar, Dheeraj
    Wu, Huayu
    Rajasegarar, Sutharshan
    Leckie, Christopher
    Krishnaswamy, Shonali
    Palaniswami, Marimuthu
    IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2018, 19 (11) : 3709 - 3722
  • [30] BIGMiner: a fast and scalable distributed frequent pattern miner for big data
    Chon, Kang-Wook
    Kim, Min-Soo
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2018, 21 (03): : 1507 - 1520