An incremental clustering scheme for data de-duplication

被引:0
|
作者
Gianni Costa
Giuseppe Manco
Riccardo Ortale
机构
[1] ICAR-CNR,
来源
关键词
Clustering-mining methods and algorithms; Record classification; Indexing methods and structures; Locality-sensitive hashing; Min-wise independent permutations; Approximated similarity measures; De-duplication;
D O I
暂无
中图分类号
学科分类号
摘要
We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.
引用
收藏
页码:152 / 187
页数:35
相关论文
共 50 条
  • [31] A Novel scheme for Authenticated secured De-duplication with Identity based encryption in Cloud
    Reshma, N. S.
    Gopal, Greeshma N.
    Sreeraag, G.
    [J]. PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE (ICIS), 2016, : 228 - 232
  • [32] VMDedup: Memory De-duplication in Hypervisor
    Shaikh, Furquan
    Yao, Fangzhou
    Gupta, Indranil
    Campbell, Roy H.
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON CLOUD ENGINEERING (IC2E), 2014, : 379 - 384
  • [33] A proficient cost reduction framework for de-duplication of records in data integration
    Asif Sohail
    Muhammad Murtaza Yousaf
    [J]. BMC Medical Informatics and Decision Making, 16
  • [34] Logical Data Deletion in High-Performance De-duplication Backup
    Yang, Tianming
    Tang, Zhen
    Wan, Yaping
    Sun, Wei
    [J]. MECHATRONICS AND INDUSTRIAL INFORMATICS, PTS 1-4, 2013, 321-324 : 2519 - +
  • [35] De-Duplication Complexity of Fingerprint Data in Large-Scale Applications
    Nalla Pattabhi Ramaiah
    C.Krishna Mohan
    [J]. Journal of Electronic Science and Technology, 2014, 12 (02) : 224 - 228
  • [36] De-Duplication Complexity of Fingerprint Data in Large-Scale Applications
    Nalla Pattabhi Ramaiah
    C.Krishna Mohan
    [J]. Journal of Electronic Science and Technology, 2014, (02) : 224 - 228
  • [37] A proficient cost reduction framework for de-duplication of records in data integration
    Sohail, Asif
    Yousaf, Muhammad Murtaza
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2016, 16
  • [38] Data De-duplication and Event Processing for Security Applications on an Embedded Processor
    Nagarajaiah, Harsha
    Upadhyaya, Shambhu
    Gopal, Vinodh
    [J]. 2012 31ST INTERNATIONAL SYMPOSIUM ON RELIABLE DISTRIBUTED SYSTEMS (SRDS 2012), 2012, : 418 - 423
  • [39] Data Storage Layout for Object-based De-duplication System
    Yan, Fang
    Tan, YuAn
    [J]. SENSORS, MEASUREMENT AND INTELLIGENT MATERIALS, PTS 1-4, 2013, 303-306 : 2284 - 2288
  • [40] De-duplication scheduling strategy in real-time data warehouse
    Liu, Hui
    Song, Jie
    Wu, Jin Bo
    Bao, Yu-Bin
    [J]. Open Cybernetics and Systemics Journal, 2015, 9 (01): : 37 - 43