Reducing Replication Bandwidth for Distributed Document Databases

被引:12
|
作者
Xu, Lianghong [1 ]
Pavlo, Andrew [1 ]
Sengupta, Sudipta [2 ]
Li, Jin [2 ]
Ganger, Gregory R. [1 ]
机构
[1] Carnegie Mellon Univ, Pittsburgh, PA 15213 USA
[2] Microsoft Res, Cambridge, England
关键词
ALGORITHM;
D O I
10.1145/2806777.2806840
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
With the rise of large-scale, Web-based applications, users are increasingly adopting a new class of document-oriented database management systems (DBMSs) that allow for rapid prototyping while also achieving scalable performance. Like for other distributed storage systems, replication is important for document DBMSs in order to guarantee availability. The network bandwidth required to keep replicas synchronized is expensive and is often a performance bottleneck. As such, there is a strong need to reduce the replication bandwidth, especially for geo-replication scenarios where widearea network (WAN) bandwidth is limited. This paper presents a deduplication system called sDedup that reduces the amount of data transferred over the network for replicated document DBMSs. sDedup uses similarity-based deduplication to remove redundancy in replication data by delta encoding against similar documents selected from the entire database. It exploits key characteristics of document-oriented workloads, including small item sizes, temporal locality, and the incremental nature of document edits. Our experimental evaluation of sDedup with three real-world datasets shows that it is able to achieve up to 38x reduction in data sent over the network, significantly outperforming traditional chunk-based deduplication techniques while incurring negligible performance overhead.
引用
收藏
页码:222 / 235
页数:14
相关论文
共 50 条
  • [1] Distributed databases: Security, consistency, and replication
    Lungu, Ion
    Ghencea, Adrian
    [J]. INNOVATION AND KNOWLEDGE MANAGEMENT: A GLOBAL COMPETITIVE ADVANTAGE, VOLS 1-4, 2011, : 941 - 949
  • [2] Collection selection for managed distributed document databases
    D'Souza, D
    Thom, JA
    Zobel, J
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (03) : 527 - 546
  • [3] An adaptive object allocation and replication algorithm in distributed databases
    Wujuan, L
    Veeravalli, B
    [J]. 23RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS WORKSHOPS, 2003, : 132 - 137
  • [4] Efficient Replication Control in Distributed Real-Time Databases
    Aslinger, Andrew
    Son, Sang H.
    [J]. 3RD ACS/IEEE INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS, 2005, 2005,
  • [5] An object replication algorithm for real-time distributed databases
    Lin, Wujuan
    Veeravalli, Bharadwaj
    [J]. DISTRIBUTED AND PARALLEL DATABASES, 2006, 19 (2-3) : 125 - 146
  • [6] AN OPTIMIZED STRATEGY FOR REPLICATION IN PEER-TO-PEER DISTRIBUTED DATABASES
    Amalarethinam, D. I. George
    Balakrishnan, C.
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2012, : 266 - 269
  • [7] Epoch-based Commit and Replication in Distributed OLTP Databases
    Lu, Yi
    Yu, Xiangyao
    Cao, Lei
    Madden, Samuel
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (05): : 743 - 756
  • [8] An object replication algorithm for real-time distributed databases
    Lin Wujuan
    Bharadwaj Veeravalli
    [J]. Distributed and Parallel Databases, 2006, 19 : 125 - 146
  • [9] The Influence of Data Replication in the Knowledge Discovery in Distributed Databases Process
    Pupezescu, Valentin
    Radescu, Radu
    [J]. 2016 8TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTERS AND ARTIFICIAL INTELLIGENCE (ECAI), 2016,
  • [10] Document replication strategies for geographically distributed web search engines
    Kayaaslan, Enver
    Barla Cambazoglu, B.
    Aykanat, Cevdet
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (01) : 51 - 66