ParaML: A Polyvalent Multicore Accelerator for Machine Learning

被引：3

作者：

Zhou, Shengyuan ^{[1
,2
]}

Guo, Qi ^{[1
,3
]}

Du, Zidong ^{[1
,3
]}

Liu, Daofu ^{[1
,3
]}

Chen, Tianshi ^{[1
,3
,4
]}

Li, Ling ^{[5
]}

Liu, Shaoli ^{[1
,3
]}

Zhou, Jinhong ^{[1
,3
]}

Temam, Olivier ^{[6
]}

Feng, Xiaobing ^{[7
]}

Zhou, Xuehai ^{[8
]}

Chen, Yunji ^{[1
,2
,4
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Intelligent Processor Res Ctr, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Sch Comp Sci & Technol, Beijing 100049, Peoples R China

[3] Cambricon Technol Corp Ltd, Beijing 100191, Peoples R China

[4] CAS Ctr Excellence Brain Sci & Intelligence Techn, Beijing 100190, Peoples R China

[5] Chinese Acad Sci, Inst Software, Beijing 100190, Peoples R China

[6] Inria Scalay, F-91120 Palaiseau, France

[7] Chinese Acad Sci, Inst Comp Technol, State Key Lab Comp Architecture, Beijing 100190, Peoples R China

[8] Univ Sci & Technol China, Hefei 230026, Peoples R China

来源：

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS | 2020年 / 39卷 / 09期

基金：

北京市自然科学基金;

关键词：

Neural networks; Machine learning; Testing; Support vector machines; Linear regression; Computers; Computer architecture; Accelerator; machine learning (ML) techniques; multicore accelerator;

D O I：

10.1109/TCAD.2019.2927523

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

In recent years, machine learning (ML) techniques are proven to be powerful tools in various emerging applications. Traditionally, ML techniques are processed on general-purpose CPUs and GPUs, but their energy efficiencies are limited due to their excessive support for flexibility. As an efficient alternative to CPUs/GPUs, hardware accelerators are still limited as they often accommodate only a single ML technique (family). However, different problems may require different ML techniques, which implies that such accelerators may achieve poor learning accuracy or even be ineffective. In this paper, we present a polyvalent accelerator architecture integrated with multiple processing cores, called ParaML, which accommodates ten representative ML techniques, including k-means, k-nearest neighbors (k-NN), naive Bayes (NB), support vector machine (SVM), linear regression (LR), classification tree (CT), deep neural network (DNN), learning vector quantization (LVQ), parzen window (PW), and principal component analysis (PCA). Benefited from our thorough analysis on computational primitives and locality properties of different ML techniques, the single-core ParaML can perform up to 1056 GOP/s (e.g., additions and multiplications) in an area of 3.51 mm(2) and consumes 596 mW only, estimated by ICC and PrimeTime PX with post-synthesis netlist, respectively. Compared with the NVIDIA K20M GPU (28-nm process), the single-core ParaML (65-nm process) is 1.21x faster, and can reduce the energy by 137.93x. We also compare the single-core ParaML with other accelerators. Compared with PRINS, single-core ParaML achieves 72.09x and 2.57x energy benefit for k-NN and k-means, respectively, and speeds up each query in k-NN by 44.76x. Compared with EIE, the single-core ParaML achieves 5.02x speedup and 4.97x energy benefit with 11.62x less area when evaluating with dense DNN. Compared with TPU, the single-core ParaML achieves 2.45x better power efficiency (5647 Gop/W versus 2300 Gop/W) with 321.36x less area. Compared to the single-core version, the 8-core ParaML will further improve the speedup up to 3.98x with an area of 13.44 mm(2) and a power of 2036 mW.

引用

页码：1764 / 1777

页数：14

共 50 条

[41] Design of Target Recognition System Based on Machine Learning Hardware Accelerator
Li, Yu
Yu, Fengyuan
Cai, Qian
Qian, Meiyu
Liu, Pengfeng
Guo, Junwen
Yan, Huan
Yuan, Kun
Yu, Juan
WIRELESS PERSONAL COMMUNICATIONS, 2018, 102 (02) : 1557 - 1571
[42] Design of Target Recognition System Based on Machine Learning Hardware Accelerator
Yu Li
Fengyuan Yu
Qian Cai
Meiyu Qian
Pengfeng Liu
Junwen Guo
Huan Yan
Kun Yuan
Juan Yu
Wireless Personal Communications, 2018, 102 : 1557 - 1571
[43] Fast DSE of reconfigurable accelerator systems via ensemble machine learning
Lopes, Alba
Pereira, Monica
ANALOG INTEGRATED CIRCUITS AND SIGNAL PROCESSING, 2021, 108 (03) : 495 - 509
[44] Memory-Centric Reconfigurable Accelerator for Classification and Machine Learning Applications
Karam, Robert
Paul, Somnath
Puri, Ruchir
Bhunia, Swarup
ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS, 2017, 13 (03)
[45] Machine Learning-Based Energy Optimization for Parallel Program Execution on Multicore Chips
Mwaffaq Otoom
Pedro Trancoso
Mohammad A. Alzubaidi
Hisham Almasaeid
Arabian Journal for Science and Engineering, 2018, 43 : 7343 - 7358
[46] Java']Java Thread and Process Performance for Parallel Machine Learning on Multicore HPC Clusters
Ekanayake, Saliya
Kamburugamuve, Supun
Wickramasinghe, Pulasthi
Fox, Geoffrey C.
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 347 - 354
[47] Machine Learning-Based Energy Optimization for Parallel Program Execution on Multicore Chips
Otoom, Mwaffaq
Trancoso, Pedro
Alzubaidi, Mohammad A.
Almasaeid, Hisham
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2018, 43 (12) : 7343 - 7358
[48] Dynamic machine learning-based heuristic energy optimization approach on multicore architecture
Sundaresan, Yokesh B.
Durai, M. A. Saleem
COMPUTATIONAL INTELLIGENCE, 2024, 40 (01)
[49] A Framework to Design and Implement Real-time Multicore Schedulers using Machine Learning
Horstmann, Leonardo Passig
Conradi Hoffmann, Jose Luis
Frohlich, Antonio Augusto
2019 24TH IEEE INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION (ETFA), 2019, : 251 - 258
[50] Scalable Suffix Sorting on a Multicore Machine
Xie, Jing Yi
Nong, Ge
Lao, Bin
Xu, Wentao
IEEE TRANSACTIONS ON COMPUTERS, 2020, 69 (09) : 1364 - 1375

← 1 2 3 4 5 →