Taming data locality for task scheduling under memory constraint in runtime systems

被引:1
|
作者
Gonthier, Maxime [1 ]
Marchal, Loris [2 ]
Thibault, Samuel [3 ]
机构
[1] Inria, LIP, ENS Lyon, LaBRI, 200 Ave Vieille Tour, F-33405 Talence, Nouvelle Aquita, France
[2] Univ Claude Bernard Lyon 1, LIP, ENS Lyon, Inria,CNRS, 46 Allee Italie, F-69007 Lyon, Auvergne Rhone, France
[3] Univ Bordeaux, LaBRI, Inria Bordeaux Sud Ouest, CNRS, 200 Ave Vieille Tour, F-33405 Talence, Nouvelle Aquita, France
关键词
Memory-aware scheduling; Eviction policy; Tasks sharing data; GPUs; Runtime systems; Memory constraint; ALGORITHMS;
D O I
10.1016/j.future.2023.01.024
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary data movements. Task -based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processing time through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction of previously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete, and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the STARPU runtime system. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order, randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understand these performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.(c) 2023 Elsevier B.V. All rights reserved.
引用
收藏
页码:305 / 321
页数:17
相关论文
共 50 条
  • [1] Taming Big Data SVM with Locality-Aware Scheduling
    Ye, Mao
    Wang, Jun
    Yin, Jiangling
    Han, Dezhi
    2016 FOURTH INTERNATIONAL CONFERENCE ON ADVANCED CLOUD AND BIG DATA (CBD 2016), 2016, : 37 - 44
  • [2] Efficient task scheduling for runtime reconfigurable systems
    Fazlali, Mahmood
    Sabeghi, Mojtaba
    Zakerolhosseini, Ali
    Bertels, Koen
    JOURNAL OF SYSTEMS ARCHITECTURE, 2010, 56 (11) : 623 - 632
  • [3] Locality-Aware Scheduling of Independent Tasks for Runtime Systems
    Gonthier, Maxime
    Marchal, Loris
    Thibault, Samuel
    EURO-PAR 2021: PARALLEL PROCESSING WORKSHOPS, 2022, 13098 : 5 - 16
  • [4] Runtime Hardware/Software Task Transition Scheduling for Data-Adaptable Embedded Systems
    Sandoval, Nathan
    Mackin, Casey
    Whitsitt, Sean
    Lysecky, Roman
    Sprinkle, Jonathan
    PROCEEDINGS OF THE 2013 INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE TECHNOLOGY (FPT), 2013, : 342 - 345
  • [5] A Heuristic Method for Data Allocation and Task Scheduling on Heterogeneous Multiprocessor Systems Under Memory Constraints
    Ding, Junwen
    Song, Liangcai
    Li, Siyuan
    Wu, Chen
    He, Ronghua
    Su, Zhouxing
    Lu, Zhipeng
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2023, PT II, 2024, 14488 : 360 - 380
  • [6] An improved task scheduling algorithm based on cache locality and data locality in Hadoop
    Zhang, Peng
    Li, Chunlin
    Zhao, Yahui
    2016 17TH INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED COMPUTING, APPLICATIONS AND TECHNOLOGIES (PDCAT), 2016, : 244 - 249
  • [7] Overcoming data locality: An in-memory runtime file system with symmetrical data distribution
    Uta, Alexandru
    Sandu, Andreea
    Kielmann, Thilo
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2016, 54 : 144 - 158
  • [8] Locality Aware Task Scheduling in Parallel Data Stream Processing
    Falt, Zbynek
    Krulis, Martin
    Bednarek, David
    Yaghob, Jakub
    Zavoral, Filip
    INTELLIGENT DISTRIBUTED COMPUTING VIII, 2015, 570 : 331 - 342
  • [9] Memory-Aware Scheduling of Tasks Sharing Data on Multiple GPUs with Dynamic Runtime Systems
    Gonthier, Maxime
    Marchal, Loris
    Thibault, Samuel
    2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2022), 2022, : 694 - 704
  • [10] Task assignment and scheduling under memory constraints
    Szymanek, R
    Kuchcinski, K
    PROCEEDINGS OF THE 26TH EUROMICRO CONFERENCE, VOLS I AND II, 2000, : 84 - 90