Highly Efficient Implementation of Block Ciphers on Graphic Processing Units for Massively Large Data

被引：14

作者：

An, SangWoo ^{[1
]}

Seo, Seog Chung ^{[1
,2
]}

机构：

[1] Kookmin Univ, Dept Financial Informat Secur, Seoul 02707, South Korea

[2] Kookmin Univ, Dept Informat Secur Cryptol & Math, Seoul 02707, South Korea

来源：

APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 11期

基金：

新加坡国家研究基金会;

关键词：

AES; CHAM; LEA; Graphic Processing Unit (GPU); CUDA; Counter (CTR) mode; Parallel Processing;

D O I：

10.3390/app10113711

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

With the advent of IoT and Cloud computing service technology, the size of user data to be managed and file data to be transmitted has been significantly increased. To protect users' personal information, it is necessary to encrypt it in secure and efficient way. Since servers handling a number of clients or IoT devices have to encrypt a large amount of data without compromising service capabilities in real-time, Graphic Processing Units (GPUs) have been considered as a proper candidate for a crypto accelerator for processing a huge amount of data in this situation. In this paper, we present highly efficient implementations of block ciphers on NVIDIA GPUs (especially, Maxwell, Pascal, and Turing architectures) for environments using massively large data in IoT and Cloud computing applications. As block cipher algorithms, we choose AES, a representative standard block cipher algorithm; LEA, which was recently added in ISO/IEC 29192-2:2019 standard; and CHAM, a recently developed lightweight block cipher algorithm. To maximize the parallelism in the encryption process, we utilize Counter (CTR) mode of operation and customize it by using GPU's characteristics. We applied several optimization techniques with respect to the characteristics of GPU architecture such as kernel parallelism, memory optimization, and CUDA stream. Furthermore, we optimized each target cipher by considering the algorithmic characteristics of each cipher by implementing the core part of each cipher with handcrafted inline PTX (Parallel Thread eXecution) codes, which are virtual assembly codes in CUDA platforms. With the application of our optimization techniques, in our implementation on RTX 2070 GPU, AES and LEA show up to 310 Gbps and 2.47 Tbps of throughput, respectively, which are 10.7% and 67% improved compared with the 279.86 Gbps and 1.47 Tbps of the previous best result. In the case of CHAM, this is the first optimized implementation on GPUs and it achieves 3.03 Tbps of throughput on RTX 2070 GPU.

引用

页数：18

共 50 条

[1] Parallel Implementations of ARX-Based Block Ciphers on Graphic Processing Units
An, SangWoo
Kim, YoungBeom
Kwon, Hyeokdong
Seo, Hwajeong
Seo, Seog Chung
[J]. MATHEMATICS, 2020, 8 (11) : 1 - 25
[2] Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units
XIONG QinGang LI Bo XU Ji FANG XiaoJian WANG XiaoWei WANG LiMin HE XianFeng GE Wei State Key Laboratory of Multiphase Complex Systems Institute of Process Engineering Chinese Academy of Sciences Beijing China Graduate University of Chinese Academy of Sciences Beijing China
[J]. Chinese Science Bulletin., 2012, 57 (07) - 715
[3] Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units
XIONG QinGang1
2 Graduate University of Chinese Academy of Sciences
[J]. Science Bulletin, 2012, (07) : 707 - 715
[4] Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units
Xiong QinGang
Li Bo
Xu Ji
Fang XiaoJian
Wang XiaoWei
Wang LiMin
He XianFeng
Ge Wei
[J]. CHINESE SCIENCE BULLETIN, 2012, 57 (07): : 707 - 715
[5] Efficient implementation of Jacobi iterative method for large sparse linear systems on graphic processing units
Abal-Kassim Cheik Ahamed
Frédéric Magoulès
[J]. The Journal of Supercomputing, 2017, 73 : 3411 - 3432
[6] Efficient implementation of Jacobi iterative method for large sparse linear systems on graphic processing units
Ahamed, Abal-Kassim Cheik
Magoules, Frederic
[J]. JOURNAL OF SUPERCOMPUTING, 2017, 73 (08): : 3411 - 3432
[7] Efficient Implementation of Total FETI Solver for Graphic Processing Units Using Schur Complement
Riha, Lubomir
Brzobohaty, Tomas
Markopoulos, Alexandros
Kozubek, Tomas
Meca, Ondrej
Schenk, Olaf
Vanroose, Wim
[J]. HIGH PERFORMANCE COMPUTING IN SCIENCE AND ENGINEERING, HPCSE 2015, 2016, 9611 : 85 - 100
[8] Implementation of Iron Loss Model on Graphic Processing Units
Hussain, Sajid
Silva, Rodrigo C. P.
Lowther, David A.
[J]. IEEE TRANSACTIONS ON MAGNETICS, 2016, 52 (03)
[9] Efficient implementation of lightweight block ciphers on volta and pascal architecture
Li, Pei
Zhou, Shihao
Ren, Bingqing
Tang, Shuman
Li, Ting
Xu, Chang
Chen, Jiageng
[J]. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2019, 47 : 235 - 245
[10] An Efficient Block Cipher Implementation on Many-Core Graphics Processing Units
Lee, Sangpil
Kim, Deokho
Yi, Jaeyoung
Ro, Won Woo
[J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2012, 8 (01): : 159 - 174

← 1 2 3 4 5 →