DRLCAP: Runtime GPU Frequency Capping With Deep Reinforcement Learning

被引:0
|
作者
Wang, Yiming [1 ]
Hao, Meng [1 ]
He, Hui [1 ]
Zhang, Weizhe [1 ]
Tang, Qiuyuan [2 ]
Sun, Xiaoyang [3 ]
Wang, Zheng [3 ]
机构
[1] Harbin Inst Technol, Sch Cyberspace Sci, Harbin 150001, Heilongjiang, Peoples R China
[2] Bili Bili Technol Co Ltd, Shanghai 310240, Peoples R China
[3] Univ Leeds, Sch Comp, Leeds LS2 9JT, England
来源
基金
中国国家自然科学基金;
关键词
Graphics processing units; Computer architecture; Optimization; Power system management; Runtime; Kernel; Deep learning; Deep reinforcement learning; GPU power optimization; GPUs; power and energy optimization; POWER MANAGEMENT;
D O I
10.1109/TSUSC.2024.3362697
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Power and energy consumption is the limiting factor of modern computing systems. As the GPU becomes a mainstream computing device, power management for GPUs becomes increasingly important. Current works focus on GPU kernel-level power management, with challenges in portability due to architecture-specific considerations. We present DRLCap, a general runtime power management framework intended to support power management across various GPU architectures. It periodically monitors system-level information to dynamically detect program phase changes and model the workload and GPU system behavior. This elimination from kernel-specific constraints enhances adaptability and responsiveness. The framework leverages dynamic GPU frequency capping, which is the most widely used power knob, to control the power consumption. DRLCap employs deep reinforcement learning (DRL) to adapt to the changing of program phases by automatically adjusting its power policy through online learning, aiming to reduce the GPU power consumption without significantly compromising the application performance. We evaluate DRLCap on three NVIDIA and one AMD GPU architectures. Experimental results show that DRLCap improves prior GPU power optimization strategies by a large margin. On average, it reduces the GPU energy consumption by 22% with less than 3% performance slowdown on NVIDIA GPUs. This translates to a 20% improvement in the energy efficiency measured by the energy-delay product (EDP) over the NVIDIA default GPU power management strategy. For the AMD GPU architecture, DRLCap saves energy consumption by 10%, on average, with a 4% percentage loss, and improves energy efficiency by 8%.
引用
收藏
页码:712 / 726
页数:15
相关论文
共 50 条
  • [1] Cooperative Distributed GPU Power Capping for Deep Learning Clusters
    Kang, Dong-Ki
    Ha, Yun-Gi
    Peng, Limei
    Youn, Chan-Hyun
    IEEE TRANSACTIONS ON INDUSTRIAL ELECTRONICS, 2022, 69 (07) : 7244 - 7254
  • [2] Deep reinforcement learning for building honeypots against runtime DoS attack
    Veluchamy, Selvakumar
    Kathavarayan, Ruba Soundar
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (07) : 3981 - 4007
  • [3] Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters
    Chen, Zhaoyun
    Luo, Lei
    Quan, Wei
    Wen, Mei
    Zhang, Chunyuan
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS WORKSHOPS (IEEE INFOCOM 2019 WKSHPS), 2019, : 1023 - 1024
  • [4] PowerCoord: Power capping coordination for multi-CPU/GPU servers using reinforcement learning
    Azimi, Reza
    Jing, Chao
    Reda, Sherief
    SUSTAINABLE COMPUTING-INFORMATICS & SYSTEMS, 2020, 28
  • [5] A Feedback, Runtime Technique for Scaling the Frequency in GPU Architectures
    Wang, Yue
    Ranganathan, Nagarajan
    2014 IEEE COMPUTER SOCIETY ANNUAL SYMPOSIUM ON VLSI (ISVLSI), 2014, : 431 - 436
  • [6] Storage Efficient and Dynamic Flexible Runtime Channel Pruning via Deep Reinforcement Learning
    Chen, Jianda
    Chen, Shangyu
    Pan, Sinno Jialin
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [7] RIFLING: A reinforcement learning-based GPU scheduler for deep learning research and development platforms
    Chen, Zhaoyun
    SOFTWARE-PRACTICE & EXPERIENCE, 2022, 52 (06): : 1319 - 1336
  • [8] Demystifying GPU UVM Cost with Deep Runtime and Workload Analysis
    Allen, Tyler
    Ge, Rong
    2021 IEEE 35TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS), 2021, : 141 - 150
  • [9] Runtime Safety Assurance Using Reinforcement Learning
    Lazarus, Christopher
    Lopez, James G.
    Kochenderfer, Mykel J.
    2020 AIAA/IEEE 39TH DIGITAL AVIONICS SYSTEMS CONFERENCE (DASC) PROCEEDINGS, 2020,
  • [10] Optimal Runtime Assurance via Reinforcement Learning
    Miller, Krishna
    Zeitler, Christopher K.
    Shen, William
    Hobbs, Kerianne
    Schierman, John
    Viswanathan, Mahesh
    Mitra, Sayan
    PROCEEDINGS 15TH ACM/IEEE INTERNATIONAL CONFERENCE ON CYBER-PHYSICAL SYSTEMS, ICCPS 2024, 2024, : 67 - 76