Performance Prediction for Large-Scale Parallel Applications Using Representative Replay

被引:12
|
作者
Zhai, Jidong [1 ]
Chen, Wenguang [1 ]
Zheng, Weimin [1 ]
Li, Keqin [2 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing 100084, Peoples R China
[2] SUNY Coll New Paltz, Dept Comp Sci, New Paltz, NY 12561 USA
关键词
Deterministic replay; high performance computing; MPI; parallel applications; performance prediction; trace-driven simulation; MODEL;
D O I
10.1109/TC.2015.2479630
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Automatically predicting performance of parallel applications has been a long-standing goal in the area of high performance computing. However, accurate performance prediction is challenging, since the execution time of parallel applications is determined by several factors, such as sequential computation time, communication time and their complex interactions. Despite previous efforts, accurately estimating the sequential computation time in each process for large-scale parallel applications remains an open problem. In this paper, we propose a novel approach to acquiring accurate sequential computation time using a parallel debugging technique called deterministic replay. The main advantage of our approach is that we only need a single node of a target platform but the whole target platform does not need to be available. Therefore, with this approach we can simply measure the real sequential computation time on a target node for each process on by one. Moreover, we observe that there is great computation similarity in parallel applications, not only within each process but also among different processes. Based on this observation, we further propose representative replay that can significantly reduce replay overhead, because we only need to replay partial iterations for representative processes instead of all of them. Finally, we implement a complete performance prediction system, called PHANTOM, which combines the above computation-time acquisition approach and a trace-driven simulator. We validate our approach on both traditional HPC platforms and the latest Amazon EC2 cloud platform. On both types of platforms, prediction error of our approach is less than 7 percent on average up to 2,500 processes.
引用
收藏
页码:2184 / 2198
页数:15
相关论文
共 50 条
  • [1] Continuous Performance Monitoring for Large-Scale Parallel Applications
    Dooley, Isaac
    Lee, Chee Wai
    Kale, Laxmikant V.
    [J]. 16TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), PROCEEDINGS, 2009, : 445 - 452
  • [2] PHANTOM: Predicting Performance of Parallel Applications on Large-Scale Parallel Machines Using a Single Node
    Zhai, Jidong
    Chen, Wenguang
    Zheng, Weimin
    [J]. ACM SIGPLAN NOTICES, 2010, 45 (05) : 305 - 314
  • [3] PHANTOM: Predicting Performance of Parallel Applications on Large-Scale Parallel Machines Using a Single Node
    Zhai, Jidong
    Chen, Wenguang
    Zheng, Weimin
    [J]. PPOPP 2010: PROCEEDINGS OF THE 2010 ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING, 2010, : 305 - 314
  • [4] Performance prediction of large parallel applications using parallel simulations
    Bagrodia, R
    Deelman, E
    Docy, S
    Phan, T
    [J]. ACM SIGPLAN NOTICES, 1999, 34 (08) : 151 - 162
  • [5] Parallel simulation of large-scale parallel applications
    Bagrodia, R
    Deelman, E
    Phan, T
    [J]. INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2001, 15 (01): : 3 - 12
  • [6] Probabilistic Diagnosis of Performance Faults in Large-Scale Parallel Applications
    Laguna, Ignacio
    Ahn, Dong H.
    de Supinski, Bronis R.
    Bagchi, Saurabh
    Gamblin, Todd
    [J]. PROCEEDINGS OF THE 21ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT'12), 2012, : 213 - 222
  • [7] Graph-Centric Performance Analysis for Large-Scale Parallel Applications
    Jin, Yuyang
    Wang, Haojie
    Zhong, Runxin
    Zhang, Chen
    Liao, Xia
    Zhang, Feng
    Zhai, Jidong
    [J]. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2024, 35 (07) : 1221 - 1238
  • [9] Using MPI File Caching to Improve Parallel Write Performance for Large-Scale Scientific Applications
    Liao, Wei-keng
    Ching, Avery
    Coloma, Kenin
    Nisar, Arifa
    Choudhary, Alok
    Chen, Jacqueline
    Sankaran, Ramanan
    Klasky, Scott
    [J]. 2007 ACM/IEEE SC07 CONFERENCE, 2010, : 661 - +
  • [10] Performance prediction of large-scale parallel discrete event models of physical systems
    Perumalla, KS
    Fujimoto, RM
    Thakare, PJ
    Pande, S
    Karimabadi, H
    Omelchenko, Y
    Driscoll, J
    [J]. Proceedings of the 2005 Winter Simulation Conference, Vols 1-4, 2005, : 356 - 364