Think Before You Shuffle: Data-Driven Shuffles for Geo-Distributed Analytics

被引:1
|
作者
Goyal, Maruth [1 ]
Akella, Aditya [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
关键词
geo-distributed analytics; wide-area networks;
D O I
10.1145/3530050.3532922
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data is becoming increasingly geo-distributed due to the introduction of a variety of data-locality regulation [1, 2]. This introduces many new challenges for analytics systems. In this work we focus on the significantly increased cost of data movement over a Wide Area Network (WAN). The resulting network bottleneck hinders the performance and cost of classic shuffle-based distributed join algorithms. We address this problem by designing a novel data-driven shuffle execution protocol, which utilizes fine-grained statistics over subsets of the data to locally eliminate rows from the shuffle partitions. Our experiments in simulation demonstrate the benefit of a data-driven shuffle execution procedure over a variety of real and synthetic workloads.
引用
收藏
页数:6
相关论文
共 50 条
  • [31] runData: Re-Distributing Data via Piggybacking for Geo-Distributed Data Analytics Over Edges
    Jin, Yibo
    Qian, Zhuzhong
    Guo, Song
    Zhang, Sheng
    Jiao, Lei
    Lu, Sanglu
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (01) : 40 - 55
  • [32] Compliant Geo-distributed Data Processing in Action
    Beedkar, Kaustubh
    Brekardin, David
    Quiane-Ruiz, Jorge-Anulfo
    Markl, Volker
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12): : 2843 - 2846
  • [33] Octopus: Based on Congestion-aware Scheduling on Geo-distributed Big Data Analytics Cluster
    Du, Haizhou
    Zhang, Keke
    Yang, Zhenchen
    2018 5TH INTERNATIONAL CONFERENCE ON SYSTEMS AND INFORMATICS (ICSAI), 2018, : 490 - 495
  • [34] Traffic-Aware Geo-Distributed Big Data Analytics with Predictable Job Completion Time
    Li, Peng
    Guo, Song
    Miyazaki, Toshiaki
    Liao, Xiaofei
    Jin, Hai
    Zomaya, Albert Y.
    Wang, Kun
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (06) : 1785 - 1796
  • [35] Green Computing with Geo-Distributed Heterogeneous Data Centers
    Pasricha, Sudeep
    Hogade, Ninad
    Siegel, Howard Jay
    Maciejewski, Anthony A.
    2019 TENTH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING CONFERENCE (IGSC), 2019,
  • [36] Yugong: Geo-Distributed Data and Job Placement at Scale
    Huang, Yuzhen
    Shi, Yingjie
    Zhong, Zheng
    Feng, Yihui
    Cheng, James
    Li, Jiwei
    Fang, Haochuan
    Li, Chao
    Guan, Tao
    Zhou, Jingren
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (12): : 2155 - 2169
  • [37] Investigation of Network Traffic in Geo-Distributed Data Centers
    Koshiba, Yutaka
    Chen, Wuhui
    Yamada, Yuichi
    Tanaka, Takazumi
    Paik, Incheon
    2015 IEEE 7TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE & TECHNOLOGY (ICAST), 2015, : 174 - 179
  • [38] Fast Big Data Analysis in Geo-Distributed Cloud
    Li, Yue
    Zhao, Laiping
    Cui, Chenzhou
    Yu, Ce
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 388 - 391
  • [39] Fast media caching for geo-distributed data centers
    Zhang, Wei
    Wen, Yonggang
    Liu, Fang
    Chen, Yiqiang
    Fan, Rui
    COMPUTER COMMUNICATIONS, 2018, 120 : 46 - 57
  • [40] Holistic Management of Sustainable Geo-Distributed Data Centers
    Abbasi, Zahra
    Gupta, Sandeep K. S.
    2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2015, : 426 - 435