Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

被引:0
|
作者
Mannino, Mirco [1 ]
Peccerillo, Biagio [1 ]
Mondelli, Andrea [2 ]
Bartolini, Sandro [1 ]
机构
[1] Univ Siena, Dept Informat Engn & Math, I-53100 Siena, Italy
[2] Huawei Technol Co Ltd, Cambridge CB4 0WG, England
关键词
Convolutional neural networks; direct convolution; multi-core; multi-threading; performance evaluation;
D O I
10.1109/ACCESS.2023.3283312
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, direct convolution is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to 1.67 x speedup) than matrix-matrix multiplication-based convolution in a multi-core system.
引用
收藏
页码:57514 / 57528
页数:15
相关论文
共 50 条
  • [31] Power Mapping and Modeling of Multi-core Processors
    Dev, Kapil
    Nowroz, Abdullah Nazma
    Reda, Sherief
    2013 IEEE INTERNATIONAL SYMPOSIUM ON LOW POWER ELECTRONICS AND DESIGN (ISLPED), 2013, : 39 - 44
  • [32] Language identification using multi-core processors
    Hanani, A.
    Carey, M. J.
    Russell, M. J.
    COMPUTER SPEECH AND LANGUAGE, 2012, 26 (05): : 371 - 383
  • [33] CASPAR: Hardware Patching for Multi-core Processors
    Wagner, Ilya
    Bertacco, Valeria
    DATE: 2009 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, VOLS 1-3, 2009, : 658 - 663
  • [34] PARALLELIZATION OF ADABOOST ALGORITHM ON MULTI-CORE PROCESSORS
    Chen, Yen-Kuang
    Li, Wenlong
    Tong, Xiaofeng
    2008 IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEMS: SIPS 2008, PROCEEDINGS, 2008, : 275 - 280
  • [35] The distribution of an electromagnetic and optimization computation of electrical systems by using multi-core processors
    Kasprzyk, Leszek
    Tomczewski, Andrzej
    Bednarek, Karol
    PRZEGLAD ELEKTROTECHNICZNY, 2011, 87 (12B): : 82 - 85
  • [36] Parallel modular multiplication on multi-core processors
    Giorgi, Pascal
    Imbert, Laurent
    Izard, Thomas
    2013 21ST IEEE SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 2013, : 135 - 142
  • [37] ANALYSIS OF MULTI-THREADED CODE EXECUTION ON SMALL MULTI-CORE ARCHITECTURES
    Sgroi, Kevin J.
    Spetka, Scott E.
    PROCEEDINGS OF THE ASME INTERNATIONAL DESIGN ENGINEERING TECHNICAL CONFERENCES AND COMPUTERS AND INFORMATION IN ENGINEERING CONFERENCE, DETC 2010, VOL 3, A AND B, 2010, : 807 - 814
  • [38] Performance optimization, modeling and analysis of sparse matrix-matrix products on multi-core and many-core processors
    Nagasaka, Yusuke
    Matsuoka, Satoshi
    Azad, Ariful
    Buluc, Aydin
    PARALLEL COMPUTING, 2019, 90
  • [39] Comparative analysis of debugging tools in parallel programming for multi-core processors
    Shipunov, Valeriy
    Gavryushenko, Andrey
    Kuznetsov, Eugene
    2007 PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON THE EXPERIENCE OF DESIGNING AND APPLICATION OF CAD SYSTEMS IN MICROELECTRONICS, 2007, : 426 - 428
  • [40] POSTER: Fault-tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support
    Haas, Florian
    Weis, Sebastian
    Ungerer, Theo
    Pokam, Gilles
    Wu, Youfeng
    2016 INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURE AND COMPILATION TECHNIQUES (PACT), 2016, : 421 - 422