An incremental clustering scheme for data de-duplication

被引：0

作者：

Gianni Costa

Giuseppe Manco

Riccardo Ortale

机构：

[1] ICAR-CNR,

来源：

Data Mining and Knowledge Discovery | 2010年 / 20卷

关键词：

Clustering-mining methods and algorithms; Record classification; Indexing methods and structures; Locality-sensitive hashing; Min-wise independent permutations; Approximated similarity measures; De-duplication;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.

引用

页码：152 / 187

页数：35

共 50 条

[1] An incremental clustering scheme for data de-duplication
Costa, Gianni
Manco, Giuseppe
Ortale, Riccardo
[J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2010, 20 (01) : 152 - 187
[2] Semi-supervised clustering for de-duplication
Kushagra, Shrinu
Ben-David, Shai
Ilyas, Ihab F.
[J]. 22ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 89, 2019, 89
[3] Secure Static Data De-duplication
Pawar, Rohit
Zanwar, Payal
Bora, Shruti
Kullkarni, Shweta
[J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2016, 16 (03): : 69 - 73
[4] Data De-duplication on Similar File Detection
Zhu, Yueguang
Zhang, Xingjun
Zhao, Runting
Dong, Xiaoshe
[J]. 2014 EIGHTH INTERNATIONAL CONFERENCE ON INNOVATIVE MOBILE AND INTERNET SERVICES IN UBIQUITOUS COMPUTING (IMIS), 2014, : 66 - 73
[5] Research on Chunking Algorithms of Data De-duplication
Bo, Cai
Li, Zhang Feng
Can, Wang
[J]. PROCEEDINGS OF THE 2012 INTERNATIONAL CONFERENCE ON COMMUNICATION, ELECTRONICS AND AUTOMATION ENGINEERING, 2013, 181 : 1019 - 1025
[6] Sequence of hashes compression in data de-duplication
Balachandran, Subashini
Constantinescu, Cornel
[J]. DCC: 2008 DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2008, : 505 - 505
[7] A semi-supervised framework of clustering selection for de-duplication
Kushagra, Shrinu
Saxena, Hemant
Ilyas, Ihab F.
Ben-David, Shai
[J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 208 - 219
[8] Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication
Tan, Nigel
Luettgau, Jakob
Marquez, Jack
Terianishi, Keita
Morales, Nicolas
Bhowmick, Sanjukta
Cappello, Franck
Taufer, Michela
Nicolae, Bogdan
[J]. PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 665 - 674
[9] Data Structure for Packet De-duplication in Distributed Environments
Finta, Istvan
Farkas, Lorant
Szenasi, Sandor
[J]. 2020 IEEE SIXTH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2020), 2020, : 184 - 189
[10] A Bayesian approach for de-duplication in the presence of relational data
Sosa, Juan
Rodriguez, Abel
[J]. JOURNAL OF APPLIED STATISTICS, 2024, 51 (02) : 197 - 215

← 1 2 3 4 5 →