Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly

被引:0
|
作者
Kang, Yao [1 ]
Wang, Xin [1 ]
Lan, Zhiling [1 ]
机构
[1] IIT, Chicago, IL 60616 USA
基金
美国国家科学基金会;
关键词
high performance computing; interconnect networking; parallel discrete event simulation;
D O I
10.1145/3573900.3591119
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single job. Since HPC systems are usually shared between multiple co-running workloads at the same time, network competition between co-existing workloads is inevitable. This network contention appears as workload interference, where a job's network communication can be severely delayed by other jobs. Recent studies show that, compared with the deployed adaptive routing algorithms, an intelligent routing solution based on reinforcement learning named Q-adaptive routing can reduce workload interference. In addition to improving routing efficiency, job placement is a simple yet effective method to mitigate workload interference. In this study, we leverage the well-known parallel discrete event simulation toolkit, SST, to investigate workload interference on Dragonfly with three contributions. We first develop an automatic module that serves as the bridge between SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Next, we propose a flexible job placement strategy that can mitigate workload interference based on workload communication characteristics. Finally, we extensively examine the workload interference under various job placement and routing configurations.
引用
收藏
页码:23 / 33
页数:11
相关论文
共 50 条
  • [1] Reproducibility Report for the Paper: "Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly"
    Koester, Till
    PROCEEDINGS OF THE 2023 ACM SIGSIM INTERNATIONAL CONFERENCE ON PRINCIPLES OF ADVANCED DISCRETE SIMULATION, ACMSIGSIM-PADS 2023, 2023, : 151 - 153
  • [2] Study of Workload Interference with Intelligent Routing on Dragonfly
    Kang, Yao
    Wang, Xin
    Lan, Zhiling
    SC22: INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2022,
  • [3] Simulation of intelligent hierarchical flexible manufacturing: Batch job routing in operation overlapping
    Cho, TH
    Zeigler, BP
    IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART A-SYSTEMS AND HUMANS, 1997, 27 (01): : 116 - 126
  • [4] Watch Out for the Bully! Job Interference Study on Dragonfly Network
    Yang, Xu
    Jenkins, John
    Mubarak, Misbah
    Ross, Robert B.
    Lan, Zhiling
    SC '16: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS, 2016, : 750 - 760
  • [5] Adapting workload control for job shops with high routing complexity
    Soepenberg, G. D.
    Land, M. J.
    Gaalman, G. J. C.
    INTERNATIONAL JOURNAL OF PRODUCTION ECONOMICS, 2012, 140 (02) : 681 - 690
  • [6] Preliminary Interference Study about Job Placement and Routing Algorithms in the Fat-tree Topology for HPC Applications
    Qiao, Peixin
    Wang, Xin
    Yang, Xu
    Fan, Yuping
    Lan, Zhiling
    2017 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2017, : 641 - 642
  • [7] A flexible intelligent QoS unicast routing scheme in NGI
    Wang, Xingwei
    Wang, Qi
    Huang, Min
    Tian, Ye
    ICIEA 2007: 2ND IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS, VOLS 1-4, PROCEEDINGS, 2007, : 2371 - 2376
  • [8] A Review on Intelligent Scheduling and Optimization for Flexible Job Shop
    Jiang, Bin
    Ma, Yajie
    Chen, Lijun
    Huang, Binda
    Huang, Yuying
    Guan, Li
    INTERNATIONAL JOURNAL OF CONTROL AUTOMATION AND SYSTEMS, 2023, 21 (10) : 3127 - 3150
  • [9] A Review on Intelligent Scheduling and Optimization for Flexible Job Shop
    Bin Jiang
    Yajie Ma
    Lijun Chen
    Binda Huang
    Yuying Huang
    Li Guan
    International Journal of Control, Automation and Systems, 2023, 21 : 3127 - 3150
  • [10] A Novel Strategy for Flexible Placement and Routing of AVS Sensors on FPGAs
    Niemann, Christoph
    Rethfeldt, Michael
    Timmermann, Dirk
    2023 33RD INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS, FPL, 2023, : 339 - 344