Think Before You Shuffle: Data-Driven Shuffles for Geo-Distributed Analytics

被引:1
|
作者
Goyal, Maruth [1 ]
Akella, Aditya [1 ]
机构
[1] Univ Texas Austin, Austin, TX 78712 USA
关键词
geo-distributed analytics; wide-area networks;
D O I
10.1145/3530050.3532922
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data is becoming increasingly geo-distributed due to the introduction of a variety of data-locality regulation [1, 2]. This introduces many new challenges for analytics systems. In this work we focus on the significantly increased cost of data movement over a Wide Area Network (WAN). The resulting network bottleneck hinders the performance and cost of classic shuffle-based distributed join algorithms. We address this problem by designing a novel data-driven shuffle execution protocol, which utilizes fine-grained statistics over subsets of the data to locally eliminate rows from the shuffle partitions. Our experiments in simulation demonstrate the benefit of a data-driven shuffle execution procedure over a variety of real and synthetic workloads.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Low Latency Geo-distributed Data Analytics
    Pu, Qifan
    Ananthanarayanan, Ganesh
    Bodik, Peter
    Kandula, Srikanth
    Akella, Aditya
    Bahl, Paramvir
    Stoica, Ion
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2015, 45 (04) : 421 - 434
  • [2] Low Latency Geo-distributed Data Analytics
    Pu, Qifan
    Ananthanarayanan, Ganesh
    Bodik, Peter
    Kandula, Srikanth
    Akella, Aditya
    Bahl, Paramvir
    Stoica, Ion
    SIGCOMM'15: PROCEEDINGS OF THE 2015 ACM CONFERENCE ON SPECIAL INTEREST GROUP ON DATA COMMUNICATION, 2015, : 421 - 434
  • [3] WANalytics: Geo-Distributed Analytics for a Data Intensive World
    Vulimiri, Ashish
    Curino, Carlo
    Godfrey, P. Brighten
    Jungblut, Thomas
    Karanasos, Konstantinos
    Padhye, Jitu
    Varghese, George
    SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 1087 - 1092
  • [4] Bohr: Similarity Aware Geo-Distributed Data Analytics
    Li, Hangyu
    Xu, Hong
    Nutanong, Sarana
    CONEXT'18: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON EMERGING NETWORKING EXPERIMENTS AND TECHNOLOGIES, 2018, : 267 - 279
  • [5] Optimal Query Plans for Geo-distributed Data Analytics at Scale
    Pradhan, Ahana
    Karthik, Srinivas
    Subramanya, Raghunandan
    PROCEEDINGS OF 7TH JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE AND MANAGEMENT OF DATA, CODS-COMAD 2024, 2024, : 247 - 251
  • [6] Plexus: Optimizing Join Approximation for Geo-Distributed Data Analytics
    Wolfrath, Joel
    Chandra, Abhishek
    PROCEEDINGS OF THE 2023 ACM SYMPOSIUM ON CLOUD COMPUTING, SOCC 2023, 2023, : 1 - 16
  • [7] Fast, scalable and geo-distributed PCA for big data analytics
    Adnan, T. M. Tariq
    Tanjim, Md Mehrab
    Adnan, Muhammad Abdullah
    INFORMATION SYSTEMS, 2021, 98 (98)
  • [8] DAG-Aware Optimization for Geo-Distributed Data Analytics
    Wang, Qingyuan
    Gao, Bin
    Zhou, Zhi
    Xu, Fei
    Chenghao, Ouyang
    PROCEEDINGS OF THE 52ND INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2023, 2023, : 472 - 481
  • [9] A Network Cost-aware Geo-distributed Data Analytics System
    Oh, Kwangsung
    Chandra, Abhishek
    Weissman, Jon
    2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 649 - 658
  • [10] Delay-Resistant Geo-Distributed Analytics
    Mostafaei, Habib
    Smaragdakis, Georgios
    Zinner, Thomas
    Feldmann, Anja
    IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2022, 19 (04): : 4734 - 4749