SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering

被引:4
|
作者
Ribeiro, Leonardo Andrade [1 ]
Cuzzocrea, Alfredo [2 ,3 ]
Alves Bezerra, Karen Aline [4 ]
do Nascimento, Ben Hur Bahia [4 ]
机构
[1] Univ Fed Goias, Inst Informat, Goiania, Go, Brazil
[2] Univ Trieste, Trieste, Italy
[3] ICAR CNR, Trieste, Italy
[4] Univ Fed Lavras, Dept Ciencia Comp, Lavras, Brazil
关键词
Data Integration; Data Cleaning; Duplicate Identification; Set Similarity Joins; Clustering; QUERY;
D O I
10.5220/0005868700750080
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. A popular approach to duplicate identification employs similarity join to find pairs of similar records followed by a clustering algorithm to group together records that refer to the same entity. However, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance and only produces results at the end of the whole process. In this paper, we propose SjClust, a framework to integrate similarity join and clustering into a single operation. Our approach allows to smoothly accommodating a variety of cluster representation and merging strategies into set similarity join algorithms, while fully leveraging state-of-the-art optimization techniques.
引用
收藏
页码:75 / 80
页数:6
相关论文
共 50 条
  • [1] Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework
    Ribeiro, Leonardo Andrade
    Cuzzocrea, Alfredo
    Alves Bezerra, Karen Aline
    Bahia do Nascimento, Ben Hur
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, DEXA 2016, PT I, 2016, 9827 : 185 - 204
  • [2] Improving Similarity Join Algorithms using Vertical Clustering Techniques
    Tan, Lisa
    Fotouhi, Farshad
    Grosky, William
    2009 SECOND INTERNATIONAL CONFERENCE ON THE APPLICATIONS OF DIGITAL INFORMATION AND WEB TECHNOLOGIES (ICADIWT 2009), 2009, : 474 - +
  • [3] Improving Similarity Join Algorithms Using Fuzzy Clustering Technique
    Tan, Lisa
    Fotouhi, Farshad
    Grosky, William
    Pop, Horia F.
    Mouaddib, Noureddine
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 545 - +
  • [4] Parallelizing String Similarity Join Algorithms
    Yao, Ling-Chih
    Lim, Lipyeow
    DATABASES THEORY AND APPLICATIONS, ADC 2018, 2018, 10837 : 322 - 327
  • [5] Performance Evaluation of DBSCAN With Similarity Join Algorithms
    Radulescu, Iulia Maria
    Truica, Ciprian-Octavian
    Apostol, Elena-Simona
    Boicea, Alexandru
    Radulescu, Florin
    Mocanu, Mariana
    VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE, 2019, : 7957 - 7966
  • [6] A Framework for Profile Similarity: Integrating Similarity, Normativeness, and Distinctiveness
    Furr, R. Michael
    JOURNAL OF PERSONALITY, 2008, 76 (05) : 1267 - 1316
  • [7] Towards a Multi-way Similarity Join Operator
    Galkin, Mikhail
    Vidal, Maria-Esther
    Auer, Soeren
    NEW TRENDS IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2017, 2017, 767 : 267 - 274
  • [8] Integrating Tensor Similarity to Enhance Clustering Performance
    Peng, Hong
    Hu, Yu
    Chen, Jiazhou
    Wang, Haiyan
    Li, Yang
    Cai, Hongmin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (05) : 2582 - 2593
  • [9] A framework for benchmarking clustering algorithms
    Gagolewski, Marek
    SOFTWAREX, 2022, 20
  • [10] Efficient join algorithms for integrating XML data in grid environment
    Wang, HZ
    Li, JZ
    Xiong, SG
    GRID AND COOPERATIVE COMPUTING - GCC 2005, PROCEEDINGS, 2005, 3795 : 547 - 553