Accelerating RSA with Fine-Grained Parallelism Using GPU

被引：15

作者：

Yang, Yang ^{[1
,2
,3
]}

Guan, Zhi ^{[1
,2
,3
]}

Sun, Huiping ^{[2
,3
,4
]}

Chen, Zhong ^{[1
,2
,3
]}

机构：

[1] Peking Univ, Inst Software, Sch EECS, Beijing, Peoples R China

[2] MoE Key Lab High Confidence Software Technol PKU, Beijing, Peoples R China

[3] MoE Key Lab Network & Software Secur Assurance PK, Beijing, Peoples R China

[4] Peking Univ, Sch Software & Microelect, Beijing, Peoples R China

来源：

INFORMATION SECURITY PRACTICE AND EXPERIENCE, ISPEC 2015 | 2015年 / 9065卷

关键词：

RSA; GPGPU; CUDA; Montgomery Multiplication; CRT;

D O I：

10.1007/978-3-319-17533-1_31

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

RSA is a public key cryptography widely used for end-to-end authentication and key exchange in various Internet protocols, such as SSL and TLS. Compared with symmetric cryptography, the cryptographic operations in RSA is much more time consuming. This brings pressure on performance to service providers using secure protocols, and hinders these protocols from being more widely used. Graphics Processing Units (GPUs) are increasingly used for intensive data parallelism general purpose computing. GPUs often provide better throughput than CPUs at the same cost. In this paper, we propose a new approach to parallelize Montgomery multiplication under the Single Instruction Multiple Thread (SIMT) threading model of GPUs, and construct a parallel RSA implementation based on this approach, combining with other optimization techniques both in the algorithmic level and implementation level. The performance evaluation shows our RSA implementation achieves a record-breaking latency for RSA decryption implementations on GPUs: 2.6 ms for RSA-1024 and 6.5 ms for RSA-2048. The peak throughtput of decryptions per second of our implementation reaches 5,244 for RSA-2048 and 34,981 for RSA-1024 respectively, which is much faster than existing integer-based implementations. The peak throughput of our implementation is slightly slower than the fastest floating-point based implementation, while the latency of our implementation is 3 times faster.

引用

页码：454 / 468

页数：15

共 50 条

[31] Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors
Kumar, Sanjeev
Hughes, Christopher J.
Nguyen, Anthony
[J]. ISCA'07: 34TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, CONFERENCE PROCEEDINGS, 2007, : 162 - 173
[32] Testing fine-grained parallelism for the ADMM on a factor-graph
Hao, Ning
Oghbaee, AmirReza
Rostami, Mohammad
Derbinsky, Nate
Bento, Jose
[J]. 2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 835 - 844
[33] Fine-Grained Crowdsourcing for Fine-Grained Recognition
Jia Deng
Krause, Jonathan
Li Fei-Fei
[J]. 2013 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2013, : 580 - 587
[34] BALANCING FINE-GRAINED AND MEDIUM-GRAINED PARALLELISM IN SCHEDULING LOOPS FOR THE XIMD ARCHITECTURE
NEWBURN, CJ
HUANG, AS
SHEN, JP
[J]. IFIP TRANSACTIONS A-COMPUTER SCIENCE AND TECHNOLOGY, 1993, 23 : 39 - 52
[35] Hierarchical Bucket Queuing for Fine-Grained Priority Scheduling on the GPU
Kerbl, Bernhard
Kenzel, Michael
Schmalstieg, Dieter
Seidel, Hans-Peter
Steinberger, Markus
[J]. COMPUTER GRAPHICS FORUM, 2017, 36 (08) : 232 - 246
[36] Fine-grained GPU parallelization of Pairwise Local Sequence Alignment
Jain, Chirag
Kumar, Subodh
[J]. 2014 21ST INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2014,
[37] Efficient Sharing and Fine-Grained Scheduling of Virtualized GPU Resources
Zhao, Xiaohui
Yao, Jianguo
Gao, Ping
Guan, Haibing
[J]. 2018 IEEE 38TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS), 2018, : 742 - 752
[38] GPUPool: A Holistic Approach to Fine-Grained GPU Sharing in the Cloud
Tan, Xiaodan Serina
Golikov, Pavel
Vijaykumar, Nandita
Pekhimenko, Gennady
[J]. PROCEEDINGS OF THE 2022 31ST INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES, PACT 2022, 2022, : 317 - 332
[39] cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on a GPU
Zhang, Jing
Wang, Hao
Lin, Heshan
Feng, Wu-Chun
[J]. 2014 IEEE 28TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM, 2014,
[40] Fine-Grained Parallelization of a Vlasov-Poisson Application on GPU
Latu, Guillaume
[J]. EURO-PAR 2010 PARALLEL PROCESSING WORKSHOPS, 2011, 6586 : 127 - 135

← 1 2 3 4 5 →