A parametrized loop fusion algorithm for improving parallelism and cache locality

被引:27
|
作者
Singhai, SK [1 ]
McKinley, KS [1 ]
机构
[1] Univ Massachusetts, Dept Comp Sci, Amherst, MA 01003 USA
来源
COMPUTER JOURNAL | 1997年 / 40卷 / 06期
关键词
D O I
10.1093/comjnl/40.6.340
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can increase data locality and the granularity of parallel loops, thus improving program performance, Previous approaches to this problem have looked at these two benefits in isolation, In this work, we propose a new model which considers data locality, parallelism and register pressure together, We build a weighted directed acyclic graph in which the nodes represent program loops along with their register pressure, and the edges represent the amount of locality and parallelism present. The direction of an edge represents an execution order constraint. We then partition the graph into components such that the sum of the weights on the edges cut is minimized, subject to the constraint that the nodes in the same partition can be safely fused together, and the register pressure of the combined loop does not exceed the number of available registers. Previous work demonstrates that the general problem of finding optimal partitions is NP-hard, In restricted cases, we show that it is possible to arrive at the optimal solution. We give an algorithm for the restricted case and a heuristic for the general case. We demonstrate the effectiveness of fusion and our approach with experimental results.
引用
收藏
页码:340 / 355
页数:16
相关论文
共 50 条
  • [21] Improving cache locality for GPU-based volume rendering
    Sugimoto, Yuki
    Ino, Fumihiko
    Hagihara, Kenichi
    PARALLEL COMPUTING, 2014, 40 (5-6) : 59 - 69
  • [22] Improving Performance of Dynamic Programming via Parallelism and Locality on Multicore Architectures
    Tan, Guangming
    Sun, Ninghui
    Gao, Guang R.
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2009, 20 (02) : 261 - 274
  • [23] A LOOP TRANSFORMATION THEORY AND AN ALGORITHM TO MAXIMIZE PARALLELISM
    WOLF, ME
    LAM, MS
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 1991, 2 (04) : 452 - 471
  • [24] Improving Cache Performance by Combining Cost-Sensitivity and Locality Principles in Cache Replacement Algorithms
    Sheikh, Rami
    Kharbutli, Mazen
    2010 IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN, 2010, : 76 - 83
  • [25] A prefetching algorithm for improving web cache performance
    Umapathi, C.
    Raja, J.
    Journal of Applied Sciences, 2006, 6 (15) : 3122 - 3127
  • [26] On the Importance of Improving Cache Locality in Application-specific Accelerators via HLS
    Alptekin, Yasin
    San, Ismail
    2020 28TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2020,
  • [27] Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling
    Chen, Li-Jhan
    Cheng, Hsiang-Yun
    Wang, Po-Han
    Yang, Chia-Lin
    IEEE COMPUTER ARCHITECTURE LETTERS, 2017, 16 (02) : 127 - 131
  • [28] Modeling and improving locality for the sparse-matrix-vector product on cache memories
    Heras, DB
    Blanco, V
    Cabaleiro, JC
    Rivera, FF
    FUTURE GENERATION COMPUTER SYSTEMS, 2001, 18 (01) : 55 - 67
  • [29] Multi-dimensional incremental loop fusion for data locality
    Verdoolaege, S
    Bruynooghe, M
    Janssens, G
    Catthoor, F
    IEEE INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS, ARCHITECTURES, AND PROCESSORS, PROCEEDINGS, 2003, : 17 - 27
  • [30] Cache-Oblivious Wavefront: Improving Parallelism of Recursive Dynamic Programming Algorithms without Losing Cache-Efficiency
    Tang, Yuan
    You, Ronghui
    Kan, Haibin
    Tithi, Jesmin Jahan
    Ganapathi, Pramod
    Chowdhury, Rezaul A.
    ACM SIGPLAN NOTICES, 2015, 50 (08) : 205 - 214