Optimizing Communication and Cooling Costs in HPC Data Centers via Intelligent Job Allocation

被引:0
|
作者
Kaplan, Fulya [1 ]
Meng, Jie [1 ]
Coskun, Ayse K. [1 ]
机构
[1] Boston Univ, Elect & Comp Engn Dept, Boston, MA 02215 USA
关键词
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Nearly half of the energy in the computing clusters today is consumed by the cooling infrastructure. It is possible to reduce the cooling cost by allowing the data center temperatures to rise; however, component reliability constraints impose thermal thresholds as failure rates are exponentially dependent on the processor temperatures. Existing thermally-aware job allocation policies optimize the cooling costs by minimizing the peak inlet temperatures of the server nodes. An important constraint in high performance computing (HPC) data centers, however, is performance. Specifically, HPC data centers run multi-threaded applications with significant communication among the threads. Thus, performance of such applications is strongly affected by the job allocation decisions. This paper proposes a novel job allocation methodology to jointly minimize communication cost of an HPC application while also reducing the cooling energy. The proposed method also considers temperature-dependent hardware reliability as part of the optimization.
引用
收藏
页数:10
相关论文
共 33 条
  • [1] Simulation and optimization of HPC job allocation for jointly reducing communication and cooling costs
    Meng, Jie
    McCauley, Samuel
    Kaplan, Fulya
    Leung, Vitus J.
    Coskun, Ayse K.
    [J]. SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2015, 6 : 48 - 57
  • [2] Communication and cooling aware job allocation in data centers for communication-intensive workloads
    Meng, Jie
    Llamosi, Eduard
    Kaplan, Fulya
    Zhang, Chulian
    Sheng, Jiayi
    Herbordt, Martin
    Schirner, Gunar
    Coskun, Ayse K.
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2016, 96 : 181 - 193
  • [3] Method for optimizing equipment cooling effectiveness and HVAC cooling costs in telecom and data centers
    Herrlin, Magnus K.
    Khankari, Kishor
    [J]. ASHRAE TRANSACTIONS 2008, VOL 114, PT 1, 2008, 114 : 17 - +
  • [4] Cooling-Aware Job Scheduling and Node Allocation for Overprovisioned HPC Systems
    Cao, Thang
    Huang, Wei
    He, Yuan
    Kondo, Masaaki
    [J]. 2017 31ST IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2017, : 728 - 737
  • [5] Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions
    Hou, Zhengxiong
    Shen, Hong
    Zhou, Xingshe
    Gu, Jianhua
    Wang, Yunlan
    Zhao, Tianhai
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2022, 16 (05)
  • [6] Prediction of job characteristics for intelligent resource allocation in HPC systems: a survey and future directions
    Zhengxiong Hou
    Hong Shen
    Xingshe Zhou
    Jianhua Gu
    Yunlan Wang
    Tianhai Zhao
    [J]. Frontiers of Computer Science, 2022, 16
  • [7] Prediction of job characteristics for intelligent resource allocation in HPC systems:a survey and future directions
    Zhengxiong HOU
    Hong SHEN
    Xingshe ZHOU
    Jianhua GU
    Yunlan WANG
    Tianhai ZHAO
    [J]. Frontiers of Computer Science., 2022, 16 (05) - 37
  • [8] Efficient Compute-Intensive Job Allocation in Data Centers via Deep Reinforcement Learning
    Yi, Deliang
    Zhou, Xin
    Wen, Yonggang
    Tan, Rui
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2020, 31 (06) : 1474 - 1485
  • [9] Integrating cooling awareness with thermal aware workload placement for HPC data centers
    Banerjee, Ayan
    Mukherjee, Tridib
    Varsamopoulos, Georgios
    Gupta, Sandeep K. S.
    [J]. SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2011, 1 (02): : 134 - 150
  • [10] Optimizing Data Centre Energy Efficiency with Dynamic Resource Allocation and Intelligent Cooling Management through Machine Learning
    Radhakrishnan, Niranchana
    Vedhavathy, T. R.
    Bharathi, B. Marxim Rahula
    Chakaravarthi, S.
    Ramesh, T.
    [J]. JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (03) : 286 - 293