CoLoc: Distributed Data and Container Colocation for Data-Intensive Applications

被引:0
|
作者
Renner, Thomas [1 ]
Thamsen, Lauritz [1 ]
Kao, Odej [1 ]
机构
[1] Tech Univ Berlin, Berlin, Germany
关键词
Resource Management; Data Placement; Parallel Dataflows; Scheduling; Data-Intensive Applications;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The performance of scalable analytic frameworks supporting data-intensive parallel applications often depends significantly on the time it takes to read input data. Therefore, existing frameworks like Spark and Flink try to achieve a high degree of data locality by scheduling tasks on nodes where the input data resides. However, the set of nodes running a job and its tasks is chosen by a cluster resource management system like YARN, which schedules containers without taking the location of data into account. Yet, the scheduling of the frameworks is restricted to the set of nodes the containers are running on. At the same time, many jobs in productive clusters are recurring with predictable characteristics. For these jobs, it is possible to plan in advance on which nodes to place a job's input data and execution containers. In this paper we present CoLoc, a lightweight data and container scheduling assistant for recurring data-intensive analytic jobs. CoLoc allows users to define related files that serve as input for the same job. It colocates related files on a set of nodes and offers this scheduling hint to the cluster manager to also place the jobs container on these nodes. The main advantage of CoLoc is a reduction of network transfers due to a higher data locality and locally performed operators like grouping or joining two or more datasets. We implement CoLoc on Hadoop YARN and HDFS, then evaluate it on a 40 node cluster using workloads based on Apache Flink and the TPC-H benchmark suite. Compared to YARN's default scheduler and HDFS's block placement scheduler, CoLoc reduces the execution time up to 35% for the tested data-intensive workloads.
引用
收藏
页码:3008 / 3015
页数:8
相关论文
共 50 条
  • [1] Citus: Distributed PostgreSQL for Data-Intensive Applications
    Cubukcu, Umur
    Erdogan, Ozgun
    Pathak, Sumedh
    Sannakkayala, Sudhakar
    Slot, Marco
    [J]. SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2490 - 2502
  • [2] Understanding performance of distributed data-intensive applications
    Miceli, Christopher
    Miceli, Michael
    Rodriguez-Milla, Bety
    Jha, Shantenu
    [J]. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2010, 368 (1926): : 4089 - 4102
  • [3] Decoupling computation and data scheduling in distributed data-intensive applications
    Ranganathan, K
    Foster, I
    [J]. 11TH IEEE INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING, PROCEEDINGS, 2002, : 352 - 358
  • [4] NSM: A distributed storage architecture for data-intensive applications
    Ali, Z
    Malluhi, Q
    [J]. 20TH IEEE/11TH NASA GODDARD CONFERENCE ON MASS STORAGE AND TECHNOLOGIES (MSST 2003), PROCEEDINGS, 2003, : 87 - 91
  • [5] MapReduce Across Distributed Clusters for Data-intensive Applications
    Wang, Lizhe
    Tao, Jie
    Marten, Holger
    Streit, Achim
    Khan, Samee U.
    Kolodziej, Joanna
    Chen, Dan
    [J]. 2012 IEEE 26TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS & PHD FORUM (IPDPSW), 2012, : 2004 - 2011
  • [6] Open active services for data-intensive distributed applications
    Collet, C
    Vargas-Solar, G
    Grazziotin-Ribeiro, H
    [J]. 2000 INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM - PROCEEDINGS, 2000, : 349 - 359
  • [7] Supporting Load Balancing For Distributed Data-Intensive Applications
    Glimcher, Leonid
    Ravi, Vignesh T.
    Agrawal, Gagan
    [J]. 16TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), PROCEEDINGS, 2009, : 235 - 244
  • [8] Distributed Scientific Workflow Management for Data-Intensive Applications
    Shumilov, S.
    Leng, Y.
    El-Gayyar, M.
    Cremers, A. B.
    [J]. 12TH IEEE INTERNATIONAL WORKSHOP ON FUTURE TRENDS OF DISTRIBUTED COMPUTING SYSTEMS, PROCEEDINGS, 2008, : 65 - 73
  • [9] Distributed data structure templates for data-intensive remote sensing applications
    Ma, Yan
    Wang, Lizhe
    Liu, Dingsheng
    Yuan, Tao
    Liu, Peng
    Zhang, Wanfeng
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2013, 25 (12): : 1784 - 1797
  • [10] A distributed shared buffer space for data-intensive applications
    Lachaize, R
    Hansen, JS
    [J]. 2005 IEEE International Symposium on Cluster Computing and the Grid, Vols 1 and 2, 2005, : 913 - 920