Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

被引:0
|
作者
Mannino, Mirco [1 ]
Peccerillo, Biagio [1 ]
Mondelli, Andrea [2 ]
Bartolini, Sandro [1 ]
机构
[1] Univ Siena, Dept Informat Engn & Math, I-53100 Siena, Italy
[2] Huawei Technol Co Ltd, Cambridge CB4 0WG, England
关键词
Convolutional neural networks; direct convolution; multi-core; multi-threading; performance evaluation;
D O I
10.1109/ACCESS.2023.3283312
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, direct convolution is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to 1.67 x speedup) than matrix-matrix multiplication-based convolution in a multi-core system.
引用
收藏
页码:57514 / 57528
页数:15
相关论文
共 50 条
  • [1] Core Interface Optimization for Multi-core Neuromorphic Processors
    Su, Zhe
    Hwang, Hyunjung
    Torchet, Tristan
    Indiveri, Giacomo
    2023 28TH IEEE INTERNATIONAL SYMPOSIUM ON ASYNCHRONOUS CIRCUITS AND SYSTEMS, ASYNC, 2023, : 89 - 98
  • [2] Efficient Parallel Execution of Streaming Applications on Multi-Core Processors
    Schuele, Tobias
    PROCEEDINGS OF THE 19TH INTERNATIONAL EUROMICRO CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING, 2011, : 231 - 238
  • [3] Concept of a Computerized Numerical Control Kernel for Execution on Multi-core Processors
    Keinert, Matthias
    Lechler, Armin
    Verl, Alexander
    2016 IEEE 14TH INTERNATIONAL WORKSHOP ON ADVANCED MOTION CONTROL (AMC), 2016, : 581 - 586
  • [4] Parallel Optimization of Frequent Algorithm on Multi-core Processors
    Zhang, Yu
    Zhang, Jianzhong
    Xu, Jingdong
    Wu, Ying
    2012 INTERNATIONAL CONFERENCE ON CONTROL ENGINEERING AND COMMUNICATION TECHNOLOGY (ICCECT 2012), 2012, : 295 - 299
  • [5] Multi-Core Server Processors Thermal Analysis
    Xu, Guoping
    PROCEEDINGS OF THE SIXTEENTH INTERSOCIETY CONFERENCE ON THERMAL AND THERMOMECHANICAL PHENOMENA IN ELECTRONIC SYSTEMS ITHERM 2017, 2017, : 416 - 421
  • [6] Parallel optimization of convolution algorithm on multi-core DSP
    Xu, Jinwei
    Wang, Qinglin
    Li, Yalin
    Jiang, Jingfei
    Gao, Lei
    Li, Rongchun
    Li, Dongsheng
    Guofang Keji Daxue Xuebao/Journal of National University of Defense Technology, 2024, 46 (01): : 103 - 112
  • [7] Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
    邓林
    窦勇
    JournalofCentralSouthUniversityofTechnology, 2011, 18 (02) : 490 - 498
  • [8] Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
    Deng Lin
    Dou Yong
    JOURNAL OF CENTRAL SOUTH UNIVERSITY OF TECHNOLOGY, 2011, 18 (02): : 490 - 498
  • [9] Multi-core optimization for conjugate gradient benchmark on heterogeneous processors
    Lin Deng
    Yong Dou
    Journal of Central South University, 2011, 18 : 490 - 498
  • [10] Analysis of Dynamic Power Management on Multi-Core Processors
    Bircher, W. Lloyd
    John, Lizy K.
    ICS'08: PROCEEDINGS OF THE 2008 ACM INTERNATIONAL CONFERENCE ON SUPERCOMPUTING, 2008, : 327 - 338