Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on Dragonfly

被引:0
|
作者
Kang, Yao [1 ]
Wang, Xin [1 ]
Lan, Zhiling [1 ]
机构
[1] IIT, Chicago, IL 60616 USA
基金
美国国家科学基金会;
关键词
high performance computing; interconnect networking; parallel discrete event simulation;
D O I
10.1145/3573900.3591119
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Dragonfly is an indispensable interconnect topology for exascale HPC systems. To link tens of thousands of compute nodes at a reasonable cost, Dragonfly shares network resources with the entire system such that network bandwidth is not exclusive to any single job. Since HPC systems are usually shared between multiple co-running workloads at the same time, network competition between co-existing workloads is inevitable. This network contention appears as workload interference, where a job's network communication can be severely delayed by other jobs. Recent studies show that, compared with the deployed adaptive routing algorithms, an intelligent routing solution based on reinforcement learning named Q-adaptive routing can reduce workload interference. In addition to improving routing efficiency, job placement is a simple yet effective method to mitigate workload interference. In this study, we leverage the well-known parallel discrete event simulation toolkit, SST, to investigate workload interference on Dragonfly with three contributions. We first develop an automatic module that serves as the bridge between SST and HPC job scheduler for automatic simulation configuration and automated simulation launching. Next, we propose a flexible job placement strategy that can mitigate workload interference based on workload communication characteristics. Finally, we extensively examine the workload interference under various job placement and routing configurations.
引用
收藏
页码:23 / 33
页数:11
相关论文
共 50 条
  • [21] Joint Scheduling of Production and Transport with Alternative Job Routing in Flexible Manufacturing Systems
    Homayouni, Seyed Mahdi
    Fontes, Dalila B. M. M.
    14TH INTERNATIONAL GLOBAL OPTIMIZATION WORKSHOP (LEGO), 2019, 2070
  • [22] Modeling for flexible manufacturing systems with an FMS blocking mechanism and a BDSM job routing
    Zhao, XB
    Ohno, K
    IIE TRANSACTIONS, 1999, 31 (10) : 957 - 963
  • [23] Interference-aware opportunistic job placement for shared distributed deep learning clusters
    Li, Hongliang
    Zhao, Hairui
    Sun, Ting
    Li, Xiang
    Xu, Haixiao
    Li, Keqin
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2024, 183
  • [24] Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing
    Smith, Staci A.
    Cromey, Clara E.
    Lowenthal, David K.
    Domke, Jens
    Jain, Nikhil
    Thiagarajan, Jayaraman J.
    Bhatele, Abhinav
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE FOR HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE, AND ANALYSIS (SC'18), 2018,
  • [25] AN INTELLIGENT MAS TECHNOLOGY to SOLVE the FLEXIBLE JOB SHOP RESOURCE SCHEDULING PROBLEMS
    Xie, Hua (85141727@qq.com), 1600, Editura Politechnica (18):
  • [26] Dynamic Intelligent Scheduling in Low-Carbon Heterogeneous Distributed Flexible Job Shops with Job Insertions and Transfers
    Chen, Yi
    Liao, Xiaojuan
    Chen, Guangzhu
    Hou, Yingjie
    SENSORS, 2024, 24 (07)
  • [27] Solving the combined flexible job shop scheduling and vehicle routing problem with stochastic features
    Torres-Tapia, William
    Montoya-Torres, Jairo R.
    Belmokhtar-Berraf, Sana
    Ruiz-Meza, Jose
    JOURNAL OF SIMULATION, 2025, 19 (01) : 1 - 23
  • [28] Dynamic opposite learning enhanced dragonfly algorithm for solving large-scale flexible job shop scheduling problem
    Yang, Dongsheng
    Wu, Mingliang
    Li, Di
    Xu, Yunlang
    Zhou, Xianyu
    Yang, Zhile
    KNOWLEDGE-BASED SYSTEMS, 2022, 238
  • [29] Interference-Aware Workload Placement for Improving Latency Distribution of Converged HPC/Big Data Cloud Infrastructures
    Tzenetopoulos, Achilleas
    Masouros, Dimosthenis
    Xydis, Sotirios
    Soudris, Dimitrios
    EMBEDDED COMPUTER SYSTEMS: ARCHITECTURES, MODELING, AND SIMULATION, SAMOS 2021, 2022, 13227 : 108 - 123
  • [30] FEAR OF WORKLOAD, JOB AUTONOMY, AND WORK-RELATED STRESS: THE MEDIATING ROLE OF WORK-HOME INTERFERENCE
    Falco, Alessandra
    Girardi, Damiano
    Dal Corso, Laura
    Di Sipio, Annamaria
    De Carlo, Nicola A.
    TPM-TESTING PSYCHOMETRICS METHODOLOGY IN APPLIED PSYCHOLOGY, 2013, 20 (03) : 217 - 234