Plexus: Optimizing Join Approximation for Geo-Distributed Data Analytics

被引:0
|
作者
Wolfrath, Joel [1 ]
Chandra, Abhishek [1 ]
机构
[1] Univ Minnesota, Minneapolis, MN 55417 USA
关键词
Join Algorithms; Distributed Systems; Query Optimization; Wide Area Network; SAMPLES; BIG;
D O I
10.1145/3620678.3624643
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern applications are increasingly generating and persisting data across geo-distributed data centers or edge clusters rather than a single cloud. This paradigm introduces challenges for traditional query execution due to increased latency when transferring data over wide-area network links. Join queries in particular are heavily affected, due to their large output size and amount of data that must be shuffled over the network. Join sampling-computing a uniform sample from the join results-is a useful technique for reducing resource requirements. However, applying it to a geo-distributed setting is challenging, since acquiring independent samples from each location and joining on the samples does not produce uniform and independent tuples from the join result. To address these challenges, we first generalize an existing join sampling algorithm to the geo-distributed setting. We then present our system, Plexus, which introduces three additional optimizations to further reduce the network overhead and handle network and data heterogeneity: (i) weight approximation, (ii) heterogeneity awareness and (iii) sample prefetching. We evaluate Plexus on a geo-distributed system deployed across multiple AWS regions, with an implementation based on Apache Spark. Using three real-world datasets, we show that Plexus can reduce query latency by up to 80% over the default Spark join implementation on a wide class of join queries without substantially impacting sample uniformity.
引用
收藏
页码:1 / 16
页数:16
相关论文
共 50 条
  • [41] Optimizing Cost for Online Social Networks on Geo-Distributed Clouds
    Jiao, Lei
    Li, Jun
    Xu, Tianyin
    Du, Wei
    Fu, Xiaoming
    IEEE-ACM TRANSACTIONS ON NETWORKING, 2016, 24 (01) : 99 - 112
  • [42] Optimizing Concurrent Evacuation Transfers for Geo-Distributed Datacenters in SDN
    Li, Xiaole
    Wang, Hua
    Yi, Shanwen
    Yao, Xibo
    Zhu, Fangjin
    Zhai, Linbo
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2017, 2017, 10393 : 99 - 114
  • [43] Traffic-Aware Geo-Distributed Big Data Analytics with Predictable Job Completion Time
    Li, Peng
    Guo, Song
    Miyazaki, Toshiaki
    Liao, Xiaofei
    Jin, Hai
    Zomaya, Albert Y.
    Wang, Kun
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2017, 28 (06) : 1785 - 1796
  • [44] Green Computing with Geo-Distributed Heterogeneous Data Centers
    Pasricha, Sudeep
    Hogade, Ninad
    Siegel, Howard Jay
    Maciejewski, Anthony A.
    2019 TENTH INTERNATIONAL GREEN AND SUSTAINABLE COMPUTING CONFERENCE (IGSC), 2019,
  • [45] Yugong: Geo-Distributed Data and Job Placement at Scale
    Huang, Yuzhen
    Shi, Yingjie
    Zhong, Zheng
    Feng, Yihui
    Cheng, James
    Li, Jiwei
    Fang, Haochuan
    Li, Chao
    Guan, Tao
    Zhou, Jingren
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (12): : 2155 - 2169
  • [46] Investigation of Network Traffic in Geo-Distributed Data Centers
    Koshiba, Yutaka
    Chen, Wuhui
    Yamada, Yuichi
    Tanaka, Takazumi
    Paik, Incheon
    2015 IEEE 7TH INTERNATIONAL CONFERENCE ON AWARENESS SCIENCE & TECHNOLOGY (ICAST), 2015, : 174 - 179
  • [47] Fast Big Data Analysis in Geo-Distributed Cloud
    Li, Yue
    Zhao, Laiping
    Cui, Chenzhou
    Yu, Ce
    2016 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2016, : 388 - 391
  • [48] Fast media caching for geo-distributed data centers
    Zhang, Wei
    Wen, Yonggang
    Liu, Fang
    Chen, Yiqiang
    Fan, Rui
    COMPUTER COMMUNICATIONS, 2018, 120 : 46 - 57
  • [49] Holistic Management of Sustainable Geo-Distributed Data Centers
    Abbasi, Zahra
    Gupta, Sandeep K. S.
    2015 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2015, : 426 - 435
  • [50] AggNet: Cost-Aware Aggregation Networks for Geo-distributed Streaming Analytics
    Kumar, Dhruv
    Ahmad, Sohaib
    Chandra, Abhishek
    Sitaraman, Ramesh K.
    2021 ACM/IEEE 6TH SYMPOSIUM ON EDGE COMPUTING (SEC 2021), 2021, : 297 - 311