Think Before You Shuffle: Data-Driven Shuffles for Geo-Distributed Analytics

被引:1
|
作者
Goyal, Maruth [1 ]
Akella, Aditya [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
关键词
geo-distributed analytics; wide-area networks;
D O I
10.1145/3530050.3532922
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data is becoming increasingly geo-distributed due to the introduction of a variety of data-locality regulation [1, 2]. This introduces many new challenges for analytics systems. In this work we focus on the significantly increased cost of data movement over a Wide Area Network (WAN). The resulting network bottleneck hinders the performance and cost of classic shuffle-based distributed join algorithms. We address this problem by designing a novel data-driven shuffle execution protocol, which utilizes fine-grained statistics over subsets of the data to locally eliminate rows from the shuffle partitions. Our experiments in simulation demonstrate the benefit of a data-driven shuffle execution procedure over a variety of real and synthetic workloads.
引用
收藏
页数:6
相关论文
共 50 条
  • [21] Optimizing the Cost-Performance Tradeoff for Geo-distributed Data Analytics with Uncertain Demand
    Li, Wenxin
    Xu, Renhai
    Qi, Heng
    Li, Keqiu
    Zhou, Xiaobo
    2017 IEEE/ACM 25TH INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2017,
  • [22] Run Data Run! Re-distributing Data via Piggybacking for Geo-distributed Data Analytics
    Li, Yefei
    Jin, Yibo
    Chen, Haiyang
    Xi, Wenchao
    Ji, Mingtao
    Zhang, Sheng
    Qian, Zhuzhong
    Lu, Sanglu
    2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 356 - 363
  • [23] Unicorn: Unified resource orchestration for multi-domain, geo-distributed data analytics
    Xiang, Qiao
    Wang, X. Tony
    Zhang, J. Jensen
    Newman, Harvey
    Yang, Y. Richard
    Liu, Y. Jace
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2019, 93 : 188 - 197
  • [24] Adaptive Partitioning for Large-Scale Graph Analytics in Geo-Distributed Data Centers
    Zhou, Amelie Chi
    Luo, Juanyun
    Qiu, Ruibo
    Tan, Haobin
    He, Bingsheng
    Mao, Rui
    2022 IEEE 38TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2022), 2022, : 2818 - 2830
  • [25] Unicorn: Unified Resource Orchestration for Multi-Domain, Geo-Distributed Data Analytics
    Xiang, Qiao
    Chen, Shenshen
    Gao, Kai
    Newman, Harvey
    Taylor, Ian
    Zhang, Jingxuan
    Yang, Yang Richard
    2017 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTED, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2017,
  • [26] A survey on bandwidth-aware geo-distributed frameworks for big-data analytics
    Bergui, Mohammed
    Najah, Said
    Nikolov, Nikola S.
    JOURNAL OF BIG DATA, 2021, 8 (01)
  • [27] ran-GJS']JS: Orchestrating Data Analytics for Heterogeneous Geo-distributed Edges
    Jin, Yibo
    Qian, Zhuzhong
    Guo, Song
    Zhang, Sheng
    Wang, Xiaoliang
    Lu, Sanglu
    PROCEEDINGS OF THE 47TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, 2018,
  • [28] SNR: Network-aware Geo-Distributed Stream Analytics
    Mostafaei, Habib
    Afridi, Shafi
    Abawajy, Jemal H.
    21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 820 - 827
  • [29] Renewable Energy-Aware Big Data Analytics in Geo-Distributed Data Centers with Reinforcement Learning
    Xu, Chenhan
    Wang, Kun
    Li, Peng
    Xia, Rui
    Guo, Song
    Guo, Minyi
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2020, 7 (01): : 205 - 215
  • [30] Efficient Geo-Distributed Data Processing with Rout
    Jayalath, Chamikara
    Eugster, Patrick
    2013 IEEE 33RD INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2013, : 470 - 480