Massive Atomics for Massive Parallelism on GPUs

被引:0
|
作者
Egielski, Ian [1 ]
Huang, Jesse [1 ]
Zhang, Eddy Z. [1 ]
机构
[1] Rutgers State Univ, Piscataway, NJ 08855 USA
关键词
Performance; Management; GPU; Atomics; Parallelism; Concurrency;
D O I
10.1145/2775049.2602993
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
One important type of parallelism exploited in many applications is reduction type parallelism. In these applications, the order of the read-modify-write updates to one shared data object can be arbitrary as long as there is an imposed order for the read-modify-write updates. The typical way to parallelize these types of applications is to first let every individual thread perform local computation and save the results in thread-private data objects, and then merge the results from all worker threads in the reduction stage. All applications that fit into the map reduce framework belong to this category. Additionally, the machine learning, data mining, numerical analysis and scientific simulation applications may also benefit from reduction type parallelism. However, the parallelization scheme via the usage of thread-private data objects may not be viable in massively parallel GPU applications. Because the number of concurrent threads is extremely large (at least tens of thousands of), thread-private data object creation may lead to memory space explosion problems. In this paper, we propose a novel approach to deal with shared data object management for reduction type parallelism on GPUs. Our approach exploits fine-grained parallelism while at the same time maintaining good programmability. It is based on the usage of intrinsic hardware atomic instructions. Atomic operation may appear to be expensive since it causes thread serialization when multiple threads atomically update the same memory object at the same time. However, we discovered that, with appropriate atomic collision reduction techniques, the atomic implementation can outperform the non-atomics implementation, even for benchmarks known to have high performance non-atomics GPU implementations. In the meantime, the usage of atomics can greatly reduce coding complexity as neither thread-private object management or explicit thread-communication (for the shared data objects protected by atomic operations) is necessary.
引用
收藏
页码:93 / 103
页数:11
相关论文
共 50 条
  • [1] ON THE HORIZON - MASSIVE PARALLELISM
    RATTNER, J
    [J]. ELECTRONIC DESIGN, 1990, 38 (16) : 58 - 59
  • [2] MATRIX CRUNCHING WITH MASSIVE PARALLELISM
    CUSHMAN, B
    [J]. VLSI SYSTEMS DESIGN, 1988, 9 (12): : 18 - &
  • [3] MASSIVE PARALLELISM - THE NAME OF THE GAME
    FAGGIN, F
    [J]. BYTE, 1992, 17 (02): : 134 - 134
  • [4] Massive parallelism, randomness and genomic advances
    Venter, JC
    Levy, S
    Stockwell, T
    Remington, K
    Halpern, A
    [J]. NATURE GENETICS, 2003, 33 (Suppl 3) : 219 - 227
  • [5] MASSIVE PARALLELISM ACROSS SPACE IN ODES
    GEAR, CW
    [J]. APPLIED NUMERICAL MATHEMATICS, 1993, 11 (1-3) : 27 - 43
  • [6] MASSIVE PARALLELISM FOR ARTIFICIAL-INTELLIGENCE
    STEELS, L
    [J]. MICROPROCESSING AND MICROPROGRAMMING, 1987, 21 (1-5): : 17 - 19
  • [7] Topical perspective on massive threading and parallelism
    Farber, Robert M.
    [J]. JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2011, 30 : 82 - 89
  • [8] MASSIVE PARALLELISM IN ARTIFICIAL-INTELLIGENCE
    SHASTRI, L
    [J]. APPLIED OPTICS, 1987, 26 (10): : 1829 - 1844
  • [9] ALGEBRAIC RECURRENCE TRANSFORMATIONS FOR MASSIVE PARALLELISM
    FETTWEIS, GP
    THIELE, L
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-FUNDAMENTAL THEORY AND APPLICATIONS, 1993, 40 (12): : 949 - 952
  • [10] Generalised transform factorisation for massive parallelism
    Corinthios, MJ
    [J]. IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 2004, 151 (03): : 153 - 163