Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

被引:75
|
作者
Lo, JL
Eggers, SJ
Emer, JS
Levy, HM
Stamm, RL
Tullsen, DM
机构
[1] DIGITAL EQUIPMENT CORP, HUDSON, MA USA
[2] UNIV CALIF SAN DIEGO, DEPT COMP SCI & ENGN, LA JOLLA, CA 92093 USA
来源
ACM TRANSACTIONS ON COMPUTER SYSTEMS | 1997年 / 15卷 / 03期
关键词
cache interference; instruction-level parallelism; multiprocessors; multithreading; simultaneous multithreading; thread-level parallelism;
D O I
10.1145/263326.263382
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting multiple threads to share the processor's functional units simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When a program has only a single thread, all of the SMT processor's resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP. We examine two alternative on-chip parallel architectures for the next generation of processors. We compare SMT and small-scale, on-chip multiprocessors in their ability to exploit both ILP and TLP. First, we identify the hardware bottlenecks that prevent multiprocessors from effectively exploiting ILP. Then, we show that because of its dynamic resource sharing, SMT avoids these inefficiencies and benefits from being able to run more threads on a single processor. The use of TLP is especially advantageous when per-thread ILP is limited. The ease of adding additional thread contexts on an SMT (relative to adding additional processors on an MP) allows simultaneous multithreading to expose more parallelism, further increasing functional unit utilization and attaining a 52% average speedup (versus a four-processor, single-chip multiprocessor with comparable execution resources). This study also addresses an often-cited concern regarding the use of thread-level parallelism or multithreading: interference in the memory system and branch prediction hardware. We find that multiple threads cause interthread interference in the caches and place greater demands on the memory system, thus increasing average memory latencies. By exploiting thread-level parallelism, however, SMT hides these additional latencies, so that they only have a small impact on total program performance. We also find that for parallel applications, the additional threads have minimal effects on branch prediction.
引用
收藏
页码:322 / 354
页数:33
相关论文
共 50 条
  • [21] Trimaran: An infrastructure for research in instruction-level parallelism
    Chakrapani, LN
    Gyllenhaal, J
    Hwu, WMW
    Mahlke, SA
    Palem, KV
    Rabbah, RM
    LANGUAGES AND COMPILERS FOR HIGH PERFORMANCE COMPUTING, 2005, 3602 : 32 - 41
  • [22] HELIX: MAKING THE EXTRACTION OF THREAD-LEVEL PARALLELISM MAINSTREAM
    Campanoni, Simone
    Jones, Timothy M.
    Holloway, Glenn
    Wei, Gu-Yeon
    Brooks, David
    IEEE MICRO, 2012, 32 (04) : 8 - 18
  • [23] Thread-level parallelism and interactive performance of desktop applications
    Flautner, K
    Uhlig, R
    Reinhardt, S
    Mudge, T
    ACM SIGPLAN NOTICES, 2000, 35 (11) : 129 - 138
  • [24] Exploiting speculative thread-level parallelism on a SMT processor
    Marcuello, P
    González, A
    HIGH-PERFORMANCE COMPUTING AND NETWORKING, PROCEEDINGS, 1999, 1593 : 754 - 763
  • [25] Thread partitioning and value prediction for exploiting speculative thread-level parallelism
    Marcuello, P
    González, A
    Tubella, J
    IEEE TRANSACTIONS ON COMPUTERS, 2004, 53 (02) : 114 - 125
  • [26] Balancing thread partition for efficiently exploiting speculative thread-level parallelism
    Wang, Yaobin
    An, Hong
    Liang, Bo
    Wang, Li
    Cong, Ming
    Ren, Yongqing
    ADVANCED PARALLEL PROCESSING TECHNOLOGIES, PROCEEDINGS, 2007, 4847 : 40 - 49
  • [27] Exploratory Study of Techniques for Exploiting Instruction-Level Parallelism
    Misra, Sanjay
    Alfa, Abraham Ayegba
    Olaniyi, Mikail Olayemi
    Adewale, Sunday Olamide
    2014 GLOBAL SUMMIT ON COMPUTER & INFORMATION TECHNOLOGY (GSCIT), 2014,
  • [28] Exploitation of instruction-level parallelism for optimal loop scheduling
    Müller, J
    Fimmel, D
    Merker, R
    EIGHTH WORKSHOP ON INTERACTION BETWEEN COMPILERS AND COMPUTER ARCHITECTURES, PROCEEDINGS, 2004, : 13 - 21
  • [29] Limits and graph structure of available instruction-level parallelism
    Stefanovic, D
    Martonosi, M
    EURO-PAR 2000 PARALLEL PROCESSING, PROCEEDINGS, 2000, 1900 : 1018 - 1022
  • [30] An asynchronous superscalar architecture for exploiting instruction-level parallelism
    Werner, T
    Akella, V
    SEVENTH INTERNATIONAL SYMPOSIUM ON ASYNCHRONOUS CIRCUITS AND SYSTEMS, PROCEEDINGS, 2001, : 140 - 151