DAG-Aware Optimization for Geo-Distributed Data Analytics

被引:0
|
作者
Wang, Qingyuan [1 ]
Gao, Bin [1 ]
Zhou, Zhi [2 ]
Xu, Fei [3 ]
Chenghao, Ouyang [4 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
[2] Sun Yat Sen Univ, Guangzhou, Guangdong, Peoples R China
[3] East China Normal Univ, Shanghai, Peoples R China
[4] Shenzhen Inst Adv Technol, Shenzhen, Guangdong, Peoples R China
基金
美国国家科学基金会;
关键词
geo-distributed; big data; scheduling; directed acyclic grpah; COST; JOBS;
D O I
10.1145/3605573.3605575
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Geo-distributed data analytics has been proposed to analyze geographically distributed data. Existing studies have achieved significant reductions in execution time and data transfer cost ($) of data analytics jobs by optimizing task placement. Given a directed acyclic graph (DAG)-style job, however, they mainly optimize each stage independently, and they tend to distribute tasks and intermediate data across all locations, potentially inflating execution time and data transfer cost of descendent stages and the whole job. In this paper, we propose a DAG-aware approach to minimize job data transfer costs while guaranteeing job execution time. Specifically, we design a two-phase static/runtime algorithm that is both lightweight and adaptive to dynamics. The static phase estimates the optimal placement of all stages in the job, minimizing the job data transfer cost. Then for each stage ready to be executed, the runtime phase re-optimizes its task placement based on the static task placement of child stages and runtime information. It minimizes the stage data transfer cost while incorporating the stage execution time with a simple control knob. Overall, our approach properly aggregates early-stage tasks to fewer data centers, thereby reducing subsequent stages and whole job data transfer cost and execution time. We implement our approach in Spark and evaluate it across geo-distributed datacenters. Our approach reduces application data transfer cost by up to 91% without increasing job execution time compared to existing baselines.
引用
收藏
页码:472 / 481
页数:10
相关论文
共 50 条
  • [1] Bohr: Similarity Aware Geo-Distributed Data Analytics
    Li, Hangyu
    Xu, Hong
    Nutanong, Sarana
    [J]. CONEXT'18: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON EMERGING NETWORKING EXPERIMENTS AND TECHNOLOGIES, 2018, : 267 - 279
  • [2] A Network Cost-aware Geo-distributed Data Analytics System
    Oh, Kwangsung
    Chandra, Abhishek
    Weissman, Jon
    [J]. 2020 20TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2020), 2020, : 649 - 658
  • [3] Network Cost-Aware Geo-Distributed Data Analytics System
    Oh, Kwangsung
    Zhang, Minmin
    Chandra, Abhishek
    Weissman, Jon
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2022, 33 (06) : 1407 - 1420
  • [4] Low Latency Geo-distributed Data Analytics
    Pu, Qifan
    Ananthanarayanan, Ganesh
    Bodik, Peter
    Kandula, Srikanth
    Akella, Aditya
    Bahl, Paramvir
    Stoica, Ion
    [J]. ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2015, 45 (04) : 421 - 434
  • [5] Low Latency Geo-distributed Data Analytics
    Pu, Qifan
    Ananthanarayanan, Ganesh
    Bodik, Peter
    Kandula, Srikanth
    Akella, Aditya
    Bahl, Paramvir
    Stoica, Ion
    [J]. SIGCOMM'15: PROCEEDINGS OF THE 2015 ACM CONFERENCE ON SPECIAL INTEREST GROUP ON DATA COMMUNICATION, 2015, : 421 - 434
  • [6] SNR: Network-aware Geo-Distributed Stream Analytics
    Mostafaei, Habib
    Afridi, Shafi
    Abawajy, Jemal H.
    [J]. 21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING (CCGRID 2021), 2021, : 820 - 827
  • [7] A survey on bandwidth-aware geo-distributed frameworks for big-data analytics
    Mohammed Bergui
    Said Najah
    Nikola S. Nikolov
    [J]. Journal of Big Data, 8
  • [8] A survey on bandwidth-aware geo-distributed frameworks for big-data analytics
    Bergui, Mohammed
    Najah, Said
    Nikolov, Nikola S.
    [J]. JOURNAL OF BIG DATA, 2021, 8 (01)
  • [9] DAG-aware harmonizing job scheduling and data caching for disaggregated analytics frameworks
    Tong, Yulai
    Liu, Jiazhen
    Wang, Hua
    He, Mingjian
    Zhou, Ke
    He, Rongfeng
    Zhang, Qin
    Wang, Cheng
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 156 : 116 - 129
  • [10] WANalytics: Geo-Distributed Analytics for a Data Intensive World
    Vulimiri, Ashish
    Curino, Carlo
    Godfrey, P. Brighten
    Jungblut, Thomas
    Karanasos, Konstantinos
    Padhye, Jitu
    Varghese, George
    [J]. SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 1087 - 1092