Performance Models of Data Parallel DAG Workflows for Large Scale Data Analytics

被引:1
|
作者
Shi, Juwei [1 ]
Lu, Jiaheng [2 ]
机构
[1] Microsoft Cooperat, STCA, Redmond, WA 98052 USA
[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland
关键词
MAPREDUCE; OPTIMIZATION;
D O I
10.1109/ICDEW53142.2021.00026
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Directed Acyclic Graph (DAG) workflows are widely used for large-scale data analytics in cluster-based distributed computing systems. Building an accurate performance model for a DAG on data-parallel frameworks (e.g., MapReduce) is critical to implement autonomic self-management big data systems. An accurate performance model is challenging because the allocation of pre-emptable system resources among parallel jobs may dynamically vary during execution. This resource allocation variation during execution makes it difficult to accurately estimate the execution time. In this paper, we tackle this challenge by proposing a new cost model, called Bottleneck Oriented Estimation (BOE), to estimate the allocation of preemptable resources by identifying the bottleneck to accurately predict task execution time. For a DAG workflow, we propose a state-based approach to iteratively use the resource allocation property among stages to estimate the overall execution plan. Extensive experiments were performed to validate these cost models with HiBench and TPC-H workloads. The BOE model outperforms the state-of-the-art models by a factor of five for task execution time estimation.
引用
收藏
页码:104 / 111
页数:8
相关论文
共 50 条
  • [21] Recent Trends in Data Analytics for Upstream Process Workflows
    Pokhriyal, Prashant
    Gupta, Prateek
    Khambhampaty, Sridevi
    Ullanat, Rajesh
    Pathak, Mili
    BIOPHARM INTERNATIONAL, 2022, 35 (01) : 20 - 25
  • [22] Big Data Applications Using Workflows for Data Parallel Computing
    Wang, Jianwu
    Crawl, Daniel
    Altintas, Ilkay
    Li, Weizhong
    COMPUTING IN SCIENCE & ENGINEERING, 2014, 16 (04) : 11 - 21
  • [23] A Cloud Framework for Big Data Analytics Workflows on Azure
    Marozzo, Fabrizio
    Talia, Domenico
    Trunfio, Paolo
    CLOUD COMPUTING AND BIG DATA, 2013, 23 : 182 - 191
  • [24] Visual Analytics of Large-Scale Climate Model Data
    Wong, Pak Chung
    Shen, Han-Wei
    Leung, Ruby
    Hagos, Samson
    Lee, Teng-Yok
    Tong, Xin
    Lu, Kewei
    2014 IEEE 4TH SYMPOSIUM ON LARGE DATA ANALYSIS AND VISUALIZATION (LDAV), 2014, : 85 - 92
  • [25] Disco: A Computing Platform for Large-Scale Data Analytics
    Mundkur, Prashanth
    Tuulos, Ville
    Flatow, Jared
    ERLANG 11: PROCEEDINGS OF THE 2011 ACM SIGPLAN ERLANG WORKSHOP, 2011, : 84 - 89
  • [26] Context Aware Internet of Things for Large Scale Data Analytics
    Yengi, Yeliz
    Kucuk, Keren
    2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 702 - 706
  • [27] CityPulse: Large Scale Data Analytics Framework for Smart Cities
    Puiu, Dan
    Barnaghi, Payam
    Toenjes, Ralf
    Kuemper, Daniel
    Ali, Muhammad Intizar
    Mileo, Alessandra
    Parreira, Josiane Xavier
    Fischer, Marten
    Kolozali, Sefki
    Farajidavar, Nazli
    Gao, Feng
    Iggena, Thorben
    Pham, Thu-Le
    Nechifor, Cosmin-Septimiu
    Puschmann, Daniel
    Fernandes, Joao
    IEEE ACCESS, 2016, 4 : 1086 - 1108
  • [28] Anytime Large-Scale Analytics of Linked Open Data
    Soulet, Arnaud
    Suchanek, Fabian M.
    SEMANTIC WEB - ISWC 2019, PT I, 2019, 11778 : 576 - 592
  • [29] Scalable Data Analytics from Predevelopment to Large Scale Manufacturing
    Heimes, Heiner
    Kampker, Achim
    Buhrer, Ulrich
    Steinberger, Anita
    Eirich, Joscha
    Krotil, Stefan
    2019 ASIA PACIFIC CONFERENCE ON RESEARCH IN INDUSTRIAL AND SYSTEMS ENGINEERING (APCORISE), 2019, : 12 - 17
  • [30] Visual Cascade Analytics of Large-Scale Spatiotemporal Data
    Deng, Zikun
    Weng, Di
    Liang, Yuxuan
    Bao, Jie
    Zheng, Yu
    Schreck, Tobias
    Xu, Mingliang
    Wu, Yingcai
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2022, 28 (06) : 2486 - 2499