Object-based data de-duplication method for OpenXML compound files

被引:0
|
作者
School of Computer Science & Technology, Beijing Institute of Technology, Beijing [1 ]
100086, China
不详 [2 ]
101149, China
机构
来源
Jisuanji Yanjiu yu Fazhan | / 7卷 / 1546-1557期
关键词
Object detection;
D O I
10.7544/issn1000-1239.2015.20140093
中图分类号
学科分类号
摘要
Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all the file types. It has been proven that such method is suitable for text and simple contents, and it doesn't achieve the optimal performance for compound files. Compound file is composed of unstructured data, usually occupying large storage space and containing multimedia data. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files. We analyze the content characteristic of OpenXML files and develop an object extraction method. A de-duplication granularity determining algorithm based on the object structure and distribution is proposed during this process. The purpose is to effectively detect the same objects in a file or between the different files, and to be effectively de-duplicated when the file physical layout is changed for compound files. Through the simulation experiments with typical unstructured data collection, the efficiency is promoted by 10% compared with CDC method in the unstructured data in general. ©, 2015, Science Press. All right reserved.
引用
下载
收藏
相关论文
共 50 条
  • [1] Data Storage Layout for Object-based De-duplication System
    Yan, Fang
    Tan, YuAn
    SENSORS, MEASUREMENT AND INTELLIGENT MATERIALS, PTS 1-4, 2013, 303-306 : 2284 - 2288
  • [2] An Effective RAID Data Layout for Object-Based De-duplication Backup System
    Yan Fang
    Tan Yu'an
    Zhang Quanxin
    Wu Fei
    Cheng Zijing
    Zheng Jun
    CHINESE JOURNAL OF ELECTRONICS, 2016, 25 (05) : 832 - 840
  • [3] An Effective RAID Data Layout for Object-Based De-duplication Backup System
    YAN Fang
    TAN Yu'an
    ZHANG Quanxin
    WU Fei
    CHENG Zijing
    ZHENG Jun
    Chinese Journal of Electronics, 2016, 25 (05) : 832 - 840
  • [4] Proving method of ownership of encrypted files in cloud de-duplication deletion
    Yang, Chao
    Zhang, Junwei
    Dong, Xuewen
    Ma, Jianfeng
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2015, 52 (01): : 248 - 258
  • [5] An Undirected Graph Traversal based Grouping Prediction Method for Data De-duplication
    Wang, Longxiang
    Zhang, Xingjun
    Zhu, Guofeng
    Zhu, Yueguang
    Dong, Xiaoshe
    2013 14TH ACIS INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING (SNPD 2013), 2013, : 3 - 8
  • [6] Provable Ownership of Encrypted Files in De-Duplication Cloud Storage
    Yang, Chao
    Ma, Jianfeng
    Ren, Jian
    AD HOC & SENSOR WIRELESS NETWORKS, 2015, 26 (1-4) : 43 - 72
  • [7] FBBM: A new backup method with data de-duplication capability
    Yang, Tianming
    Feng, Dan
    Liu, Jingning
    Wan, Yaping
    MUE: 2008 INTERNATIONAL CONFERENCE ON MULTIMEDIA AND UBIQUITOUS ENGINEERING, PROCEEDINGS, 2008, : 30 - +
  • [8] A method for organizing metadata of storage nodes with data de-duplication
    Wang, Guohua
    Zhao, Yuelong
    Li, Tianxiang
    Liao, Jinggui
    Journal of Computational Information Systems, 2014, 10 (09): : 3845 - 3854
  • [9] Secure Static Data De-duplication
    Pawar, Rohit
    Zanwar, Payal
    Bora, Shruti
    Kullkarni, Shweta
    INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2016, 16 (03): : 69 - 73
  • [10] A Web Page De-duplication Algorithm Based On Data Cleaning
    Lin, Jian-ming
    Liu, Dong-sheng
    Gao, Shi-wen
    Chen, Wei
    FIRST IITA INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2009, : 544 - +