Object-based data de-duplication method for OpenXML compound files

被引:0
|
作者
School of Computer Science & Technology, Beijing Institute of Technology, Beijing [1 ]
100086, China
不详 [2 ]
101149, China
机构
来源
Jisuanji Yanjiu yu Fazhan | / 7卷 / 1546-1557期
关键词
Object detection;
D O I
10.7544/issn1000-1239.2015.20140093
中图分类号
学科分类号
摘要
Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all the file types. It has been proven that such method is suitable for text and simple contents, and it doesn't achieve the optimal performance for compound files. Compound file is composed of unstructured data, usually occupying large storage space and containing multimedia data. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files. We analyze the content characteristic of OpenXML files and develop an object extraction method. A de-duplication granularity determining algorithm based on the object structure and distribution is proposed during this process. The purpose is to effectively detect the same objects in a file or between the different files, and to be effectively de-duplicated when the file physical layout is changed for compound files. Through the simulation experiments with typical unstructured data collection, the efficiency is promoted by 10% compared with CDC method in the unstructured data in general. ©, 2015, Science Press. All right reserved.
引用
下载
收藏
相关论文
共 50 条
  • [21] A Bayesian approach for de-duplication in the presence of relational data
    Sosa, Juan
    Rodriguez, Abel
    JOURNAL OF APPLIED STATISTICS, 2024, 51 (02) : 197 - 215
  • [22] Data Structure for Packet De-duplication in Distributed Environments
    Finta, Istvan
    Farkas, Lorant
    Szenasi, Sandor
    2020 IEEE SIXTH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (BIGDATASERVICE 2020), 2020, : 184 - 189
  • [23] Energy Aware Data Layout for De-duplication System
    Yan Fang
    Tan YuAn
    Liang QingGang
    Xing NingNing
    Wang YaoLei
    Zhang Xiang
    2012 13TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS, AND TECHNOLOGIES (PDCAT 2012), 2012, : 511 - 516
  • [24] Semantic Data De-duplication for Archival Storage Systems
    Liu, Chuanyi
    Ju, Dapeng
    Gu, Yu
    Zhang, Youhui
    Wang, Dongsheng
    Du, David H. C.
    2008 13TH ASIA-PACIFIC COMPUTER SYSTEMS ARCHITECTURE CONFERENCE, 2008, : 154 - +
  • [25] A study on data de-duplication schemes in cloud storage
    Kumar, Priyan Malarvizhi
    Devi, G. Usha
    Basheer, Shakila
    Parthasarathy, P.
    INTERNATIONAL JOURNAL OF GRID AND UTILITY COMPUTING, 2020, 11 (04) : 509 - 516
  • [26] A grouping prediction method based on undirected graph traversal in de-duplication system
    Wang, Longxiang
    Zhang, Xingjun
    Zhu, Guofeng
    Zhu, Yueguang
    Dong, Xiaoshe
    Hsi-An Chiao Tung Ta Hsueh/Journal of Xi'an Jiaotong University, 2013, 47 (10): : 51 - 56
  • [27] A strategy of de-duplication based on the similarity of adjacent chunks
    Zhou B.
    Tan J.-H.
    2017, Taru Publications (20) : 1577 - 1580
  • [28] Introspection-based Memory De-duplication and Migration
    Chiang, Jui-Hao
    Li, Han-Lin
    Chiueh, Tzi-cker
    ACM SIGPLAN NOTICES, 2013, 48 (07) : 51 - 61
  • [29] DATA DE-DUPLICATION WITH ADAPTIVE CHUNKING AND ACCELERATED MODIFICATION IDENTIFYING
    Zhang, Xingjun
    Zhu, Guofeng
    Wang, Endong
    Fowler, Scott
    Dong, Xiaoshe
    COMPUTING AND INFORMATICS, 2016, 35 (03) : 586 - 614
  • [30] Data De-duplication Using Cuckoo Hashing in Cloud Storage
    Sridharan, J.
    Valliyammai, C.
    Karthika, R. N.
    Kulasekaran, L. Nihil
    SOFT COMPUTING IN DATA ANALYTICS, SCDA 2018, 2019, 758 : 707 - 715