NUMA-aware parallel sparse LU factorization for SPICE-based circuit simulators on ARM multi-core processors

被引:0
|
作者
Zhou, Junsheng [1 ]
Yang, Wangdong [1 ]
Dong, Fengkun [1 ]
Lin, Shengle [1 ]
Cai, Qinyun [1 ]
Li, Kenli [1 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha 410082, Peoples R China
基金
中国国家自然科学基金; 国家重点研发计划;
关键词
High-performance computing; circuit simulation; parallel sparse lower-upper factorization; non-uniform memory access; PERFORMANCE; ALGORITHM;
D O I
10.1177/10943420241241491
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In circuit simulators that resemble the Simulation Program with Integrated Circuit Emphasis (SPICE), one of the most crucial steps is the solution of numerous sparse linear equations generated by frequency domain analysis or time domain analysis. The sparse direct solvers based on lower-upper (LU) factorization are extremely time-consuming, so their performance has become a significant bottleneck. Despite the existence of some parallel sparse direct solvers for circuit simulation problems, they remain challenging to adapt in terms of performance and scalability in the face of rapidly evolving parallel computers with multiple NUMA hardware based on ARM architecture. In this paper, we introduce a parallel sparse direct solver named HLU, which re-examines the performance of the parallel algorithm from the viewpoint of parallelism in pipeline mode and the computing efficiency of each task. To maximize task-level parallelism and further minimize the thread waiting time, HLU devises a fine-grained scheduling method based on an elimination tree in pipeline mode, which employs depth-first search (DFS-like) to iteratively search for parent tasks and then place dependent tasks in the same task queue. HLU also suggests two NUMA node affinity strategies: thread affinity optimization based on NUMA nodes topology to guarantee computational load balancing and data affinity optimization to enable effective memory placement when threads access data. The rationality and effectiveness of the sparse solver HLU are validated by the SuiteSparse Matrix Collection. In comparison with KLU and NICSLU, the experimental results and analysis show that HLU attains a speedup of up to 9.14x and 1.26x (geometric mean) on a Huawei Kunpeng 920 Server, respectively.
引用
收藏
页数:19
相关论文
共 20 条
  • [1] A Fast Parallel Sparse Solver for SPICE-based Circuit Simulators
    Chen, Xiaoming
    Wang, Yu
    Yang, Huazhong
    2015 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2015, : 205 - 210
  • [2] Maximizing Application Performance in a Multi-core, NUMA-Aware Compute Cluster by Multi-level Tuning
    Shainer, Gilad
    Lui, Pak
    Hilgeman, Martin
    Layton, Jeffrey
    Stevens, Cydney
    Stemple, Walker
    Schultz, Scot
    Ludden, Guy
    Mora, Joshua
    Kresse, Georg
    SUPERCOMPUTING (ISC 2013), 2013, 7905 : 226 - 238
  • [3] NUMAP: NUMA-aware Multi-core Pinning and Pairing for Network Slicing at the 5G Mobile Edge
    Lai, Wen-Ping
    Chiu, Kuan-Chun
    2019 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2019, : 22 - 27
  • [4] Thermal-aware Scheduling for Data Parallel Workloads on Multi-Core Processors
    Tan, Hengxing
    Ranka, Sanjay
    2014 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATION (ISCC), 2014,
  • [5] Parallel Lexical-tree Based LVCSR on Multi-core Processors
    Parihar, Naveen
    Schlueter, Ralf
    Rybach, David
    Hansen, Eric A.
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 1485 - +
  • [6] Balancing Parallel Applications on Multi-core Processors Based on Cache Partitioning
    Suo, Guang
    Yang, Xue-jun
    2009 IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS, PROCEEDINGS, 2009, : 190 - 195
  • [7] Machine Learning based Electromigration-aware Scheduler for Multi-core Processors
    Kumar, P. Jagadeesh
    Mini, M. G.
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (09) : 571 - 580
  • [8] GLU3.0: Fast GPU-based Parallel Sparse LU Factorization for Circuit Simulation
    Peng, Shaoyi
    Tan, Sheldon X. -D.
    IEEE DESIGN & TEST, 2020, 37 (03) : 78 - 90
  • [9] A Parallel Hybrid Heuristic Based on Karp's Partitioning for PTSP on Multi-core Processors
    Amar, Mohamed Abdellahi
    Khaznaji, Walid
    Bellalouna, Monia
    2017 16TH IEEE/ACIS INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS 2017), 2017, : 465 - 470
  • [10] The Research and Implementation of Parallel In-vehicle Vision System Based on Multi-core Processors
    Dai, Zhitao
    Wang, Yiwen
    Sun, Shu
    Zhang, Pan
    INDUSTRIAL DESIGN AND MECHANICAL POWER, 2012, 224 : 529 - 532