Highly Efficient Implementation of Block Ciphers on Graphic Processing Units for Massively Large Data

被引:14
|
作者
An, SangWoo [1 ]
Seo, Seog Chung [1 ,2 ]
机构
[1] Kookmin Univ, Dept Financial Informat Secur, Seoul 02707, South Korea
[2] Kookmin Univ, Dept Informat Secur Cryptol & Math, Seoul 02707, South Korea
来源
APPLIED SCIENCES-BASEL | 2020年 / 10卷 / 11期
基金
新加坡国家研究基金会;
关键词
AES; CHAM; LEA; Graphic Processing Unit (GPU); CUDA; Counter (CTR) mode; Parallel Processing;
D O I
10.3390/app10113711
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
With the advent of IoT and Cloud computing service technology, the size of user data to be managed and file data to be transmitted has been significantly increased. To protect users' personal information, it is necessary to encrypt it in secure and efficient way. Since servers handling a number of clients or IoT devices have to encrypt a large amount of data without compromising service capabilities in real-time, Graphic Processing Units (GPUs) have been considered as a proper candidate for a crypto accelerator for processing a huge amount of data in this situation. In this paper, we present highly efficient implementations of block ciphers on NVIDIA GPUs (especially, Maxwell, Pascal, and Turing architectures) for environments using massively large data in IoT and Cloud computing applications. As block cipher algorithms, we choose AES, a representative standard block cipher algorithm; LEA, which was recently added in ISO/IEC 29192-2:2019 standard; and CHAM, a recently developed lightweight block cipher algorithm. To maximize the parallelism in the encryption process, we utilize Counter (CTR) mode of operation and customize it by using GPU's characteristics. We applied several optimization techniques with respect to the characteristics of GPU architecture such as kernel parallelism, memory optimization, and CUDA stream. Furthermore, we optimized each target cipher by considering the algorithmic characteristics of each cipher by implementing the core part of each cipher with handcrafted inline PTX (Parallel Thread eXecution) codes, which are virtual assembly codes in CUDA platforms. With the application of our optimization techniques, in our implementation on RTX 2070 GPU, AES and LEA show up to 310 Gbps and 2.47 Tbps of throughput, respectively, which are 10.7% and 67% improved compared with the 279.86 Gbps and 1.47 Tbps of the previous best result. In the case of CHAM, this is the first optimized implementation on GPUs and it achieves 3.03 Tbps of throughput on RTX 2070 GPU.
引用
收藏
页数:18
相关论文
共 50 条
  • [1] Parallel Implementations of ARX-Based Block Ciphers on Graphic Processing Units
    An, SangWoo
    Kim, YoungBeom
    Kwon, Hyeokdong
    Seo, Hwajeong
    Seo, Seog Chung
    [J]. MATHEMATICS, 2020, 8 (11) : 1 - 25
  • [2] Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units
    XIONG QinGang LI Bo XU Ji FANG XiaoJian WANG XiaoWei WANG LiMin HE XianFeng GE Wei State Key Laboratory of Multiphase Complex Systems Institute of Process Engineering Chinese Academy of Sciences Beijing China Graduate University of Chinese Academy of Sciences Beijing China
    [J]. Chinese Science Bulletin., 2012, 57 (07) - 715
  • [3] Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units
    XIONG QinGang1
    2 Graduate University of Chinese Academy of Sciences
    [J]. Science Bulletin, 2012, (07) : 707 - 715
  • [4] Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units
    Xiong QinGang
    Li Bo
    Xu Ji
    Fang XiaoJian
    Wang XiaoWei
    Wang LiMin
    He XianFeng
    Ge Wei
    [J]. CHINESE SCIENCE BULLETIN, 2012, 57 (07): : 707 - 715
  • [5] Efficient implementation of Jacobi iterative method for large sparse linear systems on graphic processing units
    Abal-Kassim Cheik Ahamed
    Frédéric Magoulès
    [J]. The Journal of Supercomputing, 2017, 73 : 3411 - 3432
  • [6] Efficient implementation of Jacobi iterative method for large sparse linear systems on graphic processing units
    Ahamed, Abal-Kassim Cheik
    Magoules, Frederic
    [J]. JOURNAL OF SUPERCOMPUTING, 2017, 73 (08): : 3411 - 3432
  • [7] Efficient Implementation of Total FETI Solver for Graphic Processing Units Using Schur Complement
    Riha, Lubomir
    Brzobohaty, Tomas
    Markopoulos, Alexandros
    Kozubek, Tomas
    Meca, Ondrej
    Schenk, Olaf
    Vanroose, Wim
    [J]. HIGH PERFORMANCE COMPUTING IN SCIENCE AND ENGINEERING, HPCSE 2015, 2016, 9611 : 85 - 100
  • [8] Implementation of Iron Loss Model on Graphic Processing Units
    Hussain, Sajid
    Silva, Rodrigo C. P.
    Lowther, David A.
    [J]. IEEE TRANSACTIONS ON MAGNETICS, 2016, 52 (03)
  • [9] Efficient implementation of lightweight block ciphers on volta and pascal architecture
    Li, Pei
    Zhou, Shihao
    Ren, Bingqing
    Tang, Shuman
    Li, Ting
    Xu, Chang
    Chen, Jiageng
    [J]. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2019, 47 : 235 - 245
  • [10] An Efficient Block Cipher Implementation on Many-Core Graphics Processing Units
    Lee, Sangpil
    Kim, Deokho
    Yi, Jaeyoung
    Ro, Won Woo
    [J]. JOURNAL OF INFORMATION PROCESSING SYSTEMS, 2012, 8 (01): : 159 - 174