Think Before You Shuffle: Data-Driven Shuffles for Geo-Distributed Analytics

被引:1
|
作者
Goyal, Maruth [1 ]
Akella, Aditya [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
关键词
geo-distributed analytics; wide-area networks;
D O I
10.1145/3530050.3532922
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data is becoming increasingly geo-distributed due to the introduction of a variety of data-locality regulation [1, 2]. This introduces many new challenges for analytics systems. In this work we focus on the significantly increased cost of data movement over a Wide Area Network (WAN). The resulting network bottleneck hinders the performance and cost of classic shuffle-based distributed join algorithms. We address this problem by designing a novel data-driven shuffle execution protocol, which utilizes fine-grained statistics over subsets of the data to locally eliminate rows from the shuffle partitions. Our experiments in simulation demonstrate the benefit of a data-driven shuffle execution procedure over a variety of real and synthetic workloads.
引用
收藏
页数:6
相关论文
共 50 条
  • [41] AggNet: Cost-Aware Aggregation Networks for Geo-distributed Streaming Analytics
    Kumar, Dhruv
    Ahmad, Sohaib
    Chandra, Abhishek
    Sitaraman, Ramesh K.
    2021 ACM/IEEE 6TH SYMPOSIUM ON EDGE COMPUTING (SEC 2021), 2021, : 297 - 311
  • [42] Trading Cost and Throughput in Geo-Distributed Analytics With A Two Time Scale Approach
    Xu, Xinping
    Li, Wenxin
    Xu, Renhai
    Qi, Heng
    Li, Keqiu
    Zhou, Xiaobo
    Chen, Sheng
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2022, 10 (03) : 2163 - 2177
  • [43] GDSim: Benchmarking Geo-Distributed Data Center Schedulers
    Alves, Daniel
    Obraczka, Katia
    Kabbani, Abdul
    2021 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD NETWORKING (IEEE CLOUDNET), 2021, : 148 - 156
  • [44] Joint Data Purchasing and Data Placement in a Geo-Distributed Data Market
    Ren, Xiaoqi
    London, Palma
    Ziani, Juba
    Wierman, Adam
    SIGMETRICS/PERFORMANCE 2016: PROCEEDINGS OF THE SIGMETRICS/PERFORMANCE JOINT INTERNATIONAL CONFERENCE ON MEASUREMENT AND MODELING OF COMPUTER SCIENCE, 2016, : 383 - 384
  • [45] GeoClone: Online Task Replication and Scheduling for Geo-Distributed Analytics under Uncertainties
    Wang, Tiantian
    Qian, Zhuzhong
    Jiao, Lei
    Li, Xin
    Lu, Sanglu
    2020 IEEE/ACM 28TH INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2020,
  • [46] SDN-enabled Resource Provisioning Framework for Geo-Distributed Streaming Analytics
    Mostafaei, Habib
    Afridi, Shafi
    ACM TRANSACTIONS ON INTERNET TECHNOLOGY, 2023, 23 (01)
  • [47] Demeter: Fine-grained Function Orchestration for Geo-distributed Serverless Analytics
    Yue, Xiaofei
    Yang, Song
    Zhu, Liehuang
    Trajanovski, Stojan
    Fu, Xiaoming
    IEEE INFOCOM 2024-IEEE CONFERENCE ON COMPUTER COMMUNICATIONS, 2024, : 2498 - 2507
  • [48] Dynamic Data Replication Across Geo-Distributed Cloud Data Centres
    Jayalakshmi, D. S.
    Ranjana, T. P. Rashmi
    Ramaswamy, Srinivasan
    DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY (ICDCIT 2016), 2016, 9581 : 182 - 187
  • [49] Data Centers Selection for Moving Geo-distributed Big Data to Cloud
    Zhang, Jiangtao
    Yuan, Qiang
    Chen, Shi
    Huang, Hejiao
    Wang, Xuan
    JOURNAL OF INTERNET TECHNOLOGY, 2019, 20 (01): : 111 - 122
  • [50] Temperature Aware Workload Management in Geo-Distributed Data Centers
    Xu, Hong
    Feng, Chen
    Li, Baochun
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2015, 26 (06) : 1743 - 1753