An incremental clustering scheme for data de-duplication

被引:0
|
作者
Gianni Costa
Giuseppe Manco
Riccardo Ortale
机构
[1] ICAR-CNR,
来源
关键词
Clustering-mining methods and algorithms; Record classification; Indexing methods and structures; Locality-sensitive hashing; Min-wise independent permutations; Approximated similarity measures; De-duplication;
D O I
暂无
中图分类号
学科分类号
摘要
We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.
引用
收藏
页码:152 / 187
页数:35
相关论文
共 50 条
  • [1] An incremental clustering scheme for data de-duplication
    Costa, Gianni
    Manco, Giuseppe
    Ortale, Riccardo
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2010, 20 (01) : 152 - 187
  • [2] Semi-supervised clustering for de-duplication
    Kushagra, Shrinu
    Ben-David, Shai
    Ilyas, Ihab F.
    [J]. 22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
  • [3] Secure Static Data De-duplication
    Pawar, Rohit
    Zanwar, Payal
    Bora, Shruti
    Kullkarni, Shweta
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2016, 16 (03): : 69 - 73
  • [4] Data De-duplication on Similar File Detection
    Zhu, Yueguang
    Zhang, Xingjun
    Zhao, Runting
    Dong, Xiaoshe
    [J]. 2014 EIGHTH INTERNATIONAL CONFERENCE ON INNOVATIVE MOBILE AND INTERNET SERVICES IN UBIQUITOUS COMPUTING (IMIS), 2014, : 66 - 73
  • [5] Research on Chunking Algorithms of Data De-duplication
    Bo, Cai
    Li, Zhang Feng
    Can, Wang
    [J]. PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON COMMUNICATION, ELECTRONICS AND AUTOMATION ENGINEERING, 2013, 181 : 1019 - 1025
  • [6] Sequence of hashes compression in data de-duplication
    Balachandran, Subashini
    Constantinescu, Cornel
    [J]. DCC: 2008 DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2008, : 505 - 505
  • [7] A semi-supervised framework of clustering selection for de-duplication
    Kushagra, Shrinu
    Saxena, Hemant
    Ilyas, Ihab F.
    Ben-David, Shai
    [J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 208 - 219
  • [8] Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication
    Tan, Nigel
    Luettgau, Jakob
    Marquez, Jack
    Terianishi, Keita
    Morales, Nicolas
    Bhowmick, Sanjukta
    Cappello, Franck
    Taufer, Michela
    Nicolae, Bogdan
    [J]. PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 665 - 674
  • [9] Data Structure for Packet De-duplication in Distributed Environments
    Finta, Istvan
    Farkas, Lorant
    Szenasi, Sandor
    [J]. 2020 IEEE SIXTH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2020), 2020, : 184 - 189
  • [10] A Bayesian approach for de-duplication in the presence of relational data
    Sosa, Juan
    Rodriguez, Abel
    [J]. JOURNAL OF APPLIED STATISTICS, 2024, 51 (02) : 197 - 215