NUMA-aware parallel sparse LU factorization for SPICE-based circuit simulators on ARM multi-core processors

被引：0

作者：

Zhou, Junsheng ^{[1
]}

Yang, Wangdong ^{[1
]}

Dong, Fengkun ^{[1
]}

Lin, Shengle ^{[1
]}

Cai, Qinyun ^{[1
]}

Li, Kenli ^{[1
]}

机构：

[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China

来源：

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS | 2024年

基金：

中国国家自然科学基金; 国家重点研发计划;

关键词：

High-performance computing; circuit simulation; parallel sparse lower-upper factorization; non-uniform memory access; PERFORMANCE; ALGORITHM;

D O I：

10.1177/10943420241241491

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In circuit simulators that resemble the Simulation Program with Integrated Circuit Emphasis (SPICE), one of the most crucial steps is the solution of numerous sparse linear equations generated by frequency domain analysis or time domain analysis. The sparse direct solvers based on lower-upper (LU) factorization are extremely time-consuming, so their performance has become a significant bottleneck. Despite the existence of some parallel sparse direct solvers for circuit simulation problems, they remain challenging to adapt in terms of performance and scalability in the face of rapidly evolving parallel computers with multiple NUMA hardware based on ARM architecture. In this paper, we introduce a parallel sparse direct solver named HLU, which re-examines the performance of the parallel algorithm from the viewpoint of parallelism in pipeline mode and the computing efficiency of each task. To maximize task-level parallelism and further minimize the thread waiting time, HLU devises a fine-grained scheduling method based on an elimination tree in pipeline mode, which employs depth-first search (DFS-like) to iteratively search for parent tasks and then place dependent tasks in the same task queue. HLU also suggests two NUMA node affinity strategies: thread affinity optimization based on NUMA nodes topology to guarantee computational load balancing and data affinity optimization to enable effective memory placement when threads access data. The rationality and effectiveness of the sparse solver HLU are validated by the SuiteSparse Matrix Collection. In comparison with KLU and NICSLU, the experimental results and analysis show that HLU attains a speedup of up to 9.14x and 1.26x (geometric mean) on a Huawei Kunpeng 920 Server, respectively.

引用

页数：19

共 20 条

[1] A Fast Parallel Sparse Solver for SPICE-based Circuit Simulators
Chen, Xiaoming
Wang, Yu
Yang, Huazhong
2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2015, : 205 - 210
[2] Maximizing Application Performance in a Multi-core, NUMA-Aware Compute Cluster by Multi-level Tuning
Shainer, Gilad
Lui, Pak
Hilgeman, Martin
Layton, Jeffrey
Stevens, Cydney
Stemple, Walker
Schultz, Scot
Ludden, Guy
Mora, Joshua
Kresse, Georg
SUPERCOMPUTING (ISC 2013), 2013, 7905 : 226 - 238
[3] NUMAP: NUMA-aware Multi-core Pinning and Pairing for Network Slicing at the 5G Mobile Edge
Lai, Wen-Ping
Chiu, Kuan-Chun
2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 22 - 27
[4] Thermal-aware Scheduling for Data Parallel Workloads on Multi-Core Processors
Tan, Hengxing
Ranka, Sanjay
2014 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATION (ISCC), 2014,
[5] Parallel Lexical-tree Based LVCSR on Multi-core Processors
Parihar, Naveen
Schlueter, Ralf
Rybach, David
Hansen, Eric A.
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 1485 - +
[6] Balancing Parallel Applications on Multi-core Processors Based on Cache Partitioning
Suo, Guang
Yang, Xue-jun
2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS, PROCEEDINGS, 2009, : 190 - 195
[7] Machine Learning based Electromigration-aware Scheduler for Multi-core Processors
Kumar, P. Jagadeesh
Mini, M. G.
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (09) : 571 - 580
[8] GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation
Peng, Shaoyi
Tan, Sheldon X. -D.
IEEE DESIGN & TEST, 2020, 37 (03) : 78 - 90
[9] A Parallel Hybrid Heuristic Based on Karp's Partitioning for PTSP on Multi-core Processors
Amar, Mohamed Abdellahi
Khaznaji, Walid
Bellalouna, Monia
2017 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2017), 2017, : 465 - 470
[10] The Research and Implementation of Parallel In-vehicle Vision System Based on Multi-core Processors
Dai, Zhitao
Wang, Yiwen
Sun, Shu
Zhang, Pan
INDUSTRIAL DESIGN AND MECHANICAL POWER, 2012, 224 : 529 - 532

← 1 2 →