Plexus: Optimizing Join Approximation for Geo-Distributed Data Analytics

被引:0
|
作者
Wolfrath, Joel [1 ]
Chandra, Abhishek [1 ]
机构
[1] Univ Minnesota, Minneapolis, MN 55417 USA
关键词
Join Algorithms; Distributed Systems; Query Optimization; Wide Area Network; SAMPLES; BIG;
D O I
10.1145/3620678.3624643
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern applications are increasingly generating and persisting data across geo-distributed data centers or edge clusters rather than a single cloud. This paradigm introduces challenges for traditional query execution due to increased latency when transferring data over wide-area network links. Join queries in particular are heavily affected, due to their large output size and amount of data that must be shuffled over the network. Join sampling-computing a uniform sample from the join results-is a useful technique for reducing resource requirements. However, applying it to a geo-distributed setting is challenging, since acquiring independent samples from each location and joining on the samples does not produce uniform and independent tuples from the join result. To address these challenges, we first generalize an existing join sampling algorithm to the geo-distributed setting. We then present our system, Plexus, which introduces three additional optimizations to further reduce the network overhead and handle network and data heterogeneity: (i) weight approximation, (ii) heterogeneity awareness and (iii) sample prefetching. We evaluate Plexus on a geo-distributed system deployed across multiple AWS regions, with an implementation based on Apache Spark. Using three real-world datasets, we show that Plexus can reduce query latency by up to 80% over the default Spark join implementation on a wide class of join queries without substantially impacting sample uniformity.
引用
收藏
页码:1 / 16
页数:16
相关论文
共 50 条
  • [1] Optimizing Geo-Distributed Data Analytics with Coordinated Task Scheduling and Routing
    Zhao, Laiping
    Yang, Yanan
    Munir, Ali
    Liu, Alex X.
    Li, Yue
    Qu, Wenyu
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (02) : 279 - 293
  • [2] Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics
    Heintz, Benjamin
    Chandra, Abhishek
    Sitaraman, Ramesh K.
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2020, 8 (01) : 232 - 245
  • [3] Optimizing Timeliness and Cost in Geo-Distributed Streaming Analytics
    Heintz, Benjamin
    Chandra, Abhishek
    Sitaraman, Ramesh K.
    IEEE Transactions on Cloud Computing, 2020, 8 (01): : 232 - 245
  • [4] Low Latency Geo-distributed Data Analytics
    Pu, Qifan
    Ananthanarayanan, Ganesh
    Bodik, Peter
    Kandula, Srikanth
    Akella, Aditya
    Bahl, Paramvir
    Stoica, Ion
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2015, 45 (04) : 421 - 434
  • [5] Low Latency Geo-distributed Data Analytics
    Pu, Qifan
    Ananthanarayanan, Ganesh
    Bodik, Peter
    Kandula, Srikanth
    Akella, Aditya
    Bahl, Paramvir
    Stoica, Ion
    SIGCOMM'15: PROCEEDINGS OF THE 2015 ACM CONFERENCE ON SPECIAL INTEREST GROUP ON DATA COMMUNICATION, 2015, : 421 - 434
  • [6] Optimizing the Cost-Performance Tradeoff for Geo-distributed Data Analytics with Uncertain Demand
    Li, Wenxin
    Xu, Renhai
    Qi, Heng
    Li, Keqiu
    Zhou, Xiaobo
    2017 IEEE/ACM 25TH INTERNATIONAL SYMPOSIUM ON QUALITY OF SERVICE (IWQOS), 2017,
  • [7] WANalytics: Geo-Distributed Analytics for a Data Intensive World
    Vulimiri, Ashish
    Curino, Carlo
    Godfrey, P. Brighten
    Jungblut, Thomas
    Karanasos, Konstantinos
    Padhye, Jitu
    Varghese, George
    SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 1087 - 1092
  • [8] Bohr: Similarity Aware Geo-Distributed Data Analytics
    Li, Hangyu
    Xu, Hong
    Nutanong, Sarana
    CONEXT'18: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON EMERGING NETWORKING EXPERIMENTS AND TECHNOLOGIES, 2018, : 267 - 279
  • [9] Optimal Query Plans for Geo-distributed Data Analytics at Scale
    Pradhan, Ahana
    Karthik, Srinivas
    Subramanya, Raghunandan
    PROCEEDINGS OF 7TH JOINT INTERNATIONAL CONFERENCE ON DATA SCIENCE AND MANAGEMENT OF DATA, CODS-COMAD 2024, 2024, : 247 - 251
  • [10] Fast, scalable and geo-distributed PCA for big data analytics
    Adnan, T. M. Tariq
    Tanjim, Md Mehrab
    Adnan, Muhammad Abdullah
    INFORMATION SYSTEMS, 2021, 98 (98)