Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight

被引：8

作者：

Li, Min ^{[1
,2
]}

Yang, Chao ^{[3
,4
,5
]}

Sun, Qiao ^{[1
]}

Ma, Wen-Jing ^{[1
]}

Cao, Wen-Long ^{[1
,2
]}

Ao, Yu-Long ^{[3
,4
,5
]}

机构：

[1] Chinese Acad Sci, Inst Software, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

[3] Peking Univ, Sch Math Sci, Beijing 100871, Peoples R China

[4] Peking Univ, Ctr Data Sci, Beijing 100871, Peoples R China

[5] Peng Cheng Lab, Shenzhen 518052, Peoples R China

来源：

JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY | 2019年 / 34卷 / 01期

基金：

北京市自然科学基金; 中国国家自然科学基金;

关键词：

parallel k-means; performance optimization; SW26010; processor; Sunway TaihuLight; ALGORITHM; PERFORMANCE;

D O I：

10.1007/s11390-019-1900-5

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

With the advent of the big data era, the amounts of sampling data and the dimensions of data features are rapidly growing. It is highly desired to enable fast and efficient clustering of unlabeled samples based on feature similarities. As a fundamental primitive for data clustering, the k-means operation is receiving increasingly more attentions today. To achieve high performance k-means computations on modern multi-core/many-core systems, we propose a matrix-based fused framework that can achieve high performance by conducting computations on a distance matrix and at the same time can improve the memory reuse through the fusion of the distance-matrix computation and the nearest centroids reduction. We implement and optimize the parallel k-means algorithm on the SW26010 many-core processor, which is the major horsepower of Sunway TaihuLight. In particular, we design a task mapping strategy for load-balanced task distribution, a data sharing scheme to reduce the memory footprint and a register blocking strategy to increase the data locality. Optimization techniques such as instruction reordering and double buffering are further applied to improve the sustained performance. Discussions on block-size tuning and performance modeling are also presented. We show by experiments on both randomly generated and real-world datasets that our parallel implementation of k-means on SW26010 can sustain a double-precision performance of over 348.1 Gflops, which is 46.9% of the peak performance and 84% of the theoretical performance upper bound on a single core group, and can achieve a nearly ideal scalability to the whole SW26010 processor of four core groups. Performance comparisons with the previous state-of-the-art on both CPU and GPU are also provided to show the superiority of our optimized k-means kernel.

引用

页码：77 / 93

页数：17

共 44 条

[1] Enabling Highly Efficient k-Means Computations on the SW26010 Many-Core Processor of Sunway TaihuLight
Min Li
Chao Yang
Qiao Sun
Wen-Jing Ma
Wen-Long Cao
Yu-Long Ao
Journal of Computer Science and Technology, 2019, 34 : 77 - 93
[2] Enabling Highly Efficient Batched Matrix Multiplications on SW26010 Many-core Processor
Jiang, Lijuan
Yang, Chao
Ma, Wenjing
ACM TRANSACTIONS ON ARCHITECTURE AND CODE OPTIMIZATION, 2020, 17 (01)
[3] Benchmarking SW26010 Many-core Processor
Xu, Zhigeng
Lin, James
Matsuoka, Satoshi
2017 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2017, : 743 - 752
[4] Towards Highly Efficient DGEMM on the Emerging SW26010 Many-core Processor
Jiang, Lijuan
Yang, Chao
Ao, Yulong
Yin, Wanwang
Ma, Wenjing
Sun, Qiao
Liu, Fangfang
Lin, Rongfen
Zhang, Peng
2017 46TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING (ICPP), 2017, : 422 - 431
[5] Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor
Cao, Hao
Guo, Shaozhong
Hao, Jiangwei
Xia, Yuanyuan
Xu, Jinchen
JOURNAL OF SUPERCOMPUTING, 2022, 78 (04): : 4827 - 4849
[6] Superblock-based performance optimization for Sunway Math Library on SW26010 many-core processor
Hao Cao
Shaozhong Guo
Jiangwei Hao
Yuanyuan Xia
Jinchen Xu
The Journal of Supercomputing, 2022, 78 : 4827 - 4849
[7] Efficient Parallelization of MLFMA for 3D Electromagnetic Scattering Problems on Sunway Many-core Processor SW26010
He, W. J.
Yang, M. L.
Wang, W.
Sheng, X. Q.
2019 PHOTONICS & ELECTROMAGNETICS RESEARCH SYMPOSIUM - FALL (PIERS - FALL), 2019, : 1870 - 1876
[8] UNAT: UNstructured Acceleration Toolkit on SW26010 many-core processor
Liu, Hongbin
Ren, Hu
Gu, Hanfeng
Gao, Fei
Yang, Guangwen
ENGINEERING COMPUTATIONS, 2020, 37 (09) : 3187 - 3208
[9] Runtime Adaptive Matrix Multiplication for the SW26010 Many-Core Processor
Wu, Zheng
Li, Mingfan
Chi, Mengxian
Xu, Le
An, Hong
IEEE ACCESS, 2020, 8 : 156915 - 156928
[10] Efficient Implementation of Multilevel Fast Multipole Algorithm on SW26010 Many-core Processor
He, Wei-Jia
Yang, Ming-Lin
Sheng, Xin-Qing
2020 IEEE MTT-S INTERNATIONAL CONFERENCE ON NUMERICAL ELECTROMAGNETIC AND MULTIPHYSICS MODELING AND OPTIMIZATION (NEMO 2020), 2020,

← 1 2 3 4 5 →