Jailbreak Attack for Large Language Models: A Survey

被引：0

作者：

Li N. ^{[1
]}

Ding Y. ^{[1
]}

Jiang H. ^{[1
]}

Niu J. ^{[1
]}

Yi P. ^{[1
]}

机构：

[1] School of Cyber Science and Engineering, Shanghai Jiao Tong University, Shanghai

来源：

Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2024年 / 61卷 / 05期

基金：

中国国家自然科学基金;

关键词：

cyber security; generative artificial intelligence; jailbreak attack; large language model (LLM); natural language processing (NLP);

D O I：

10.7544/issn1000-1239.202330962

中图分类号：

学科分类号：

摘要：

In recent years, large language models (LLMs) have been widely applied in a range of downstream tasks and have demonstrated remarkable text understanding, generation, and reasoning capabilities in various fields. However, jailbreak attacks are emerging as a new threat to LLMs. Jailbreak attacks can bypass the security mechanisms of LLMs, weaken the influence of safety alignment, and induce harmful outputs from aligned LLMs. Issues such as abuse, hijacking and leakage caused by jailbreak attacks have posed serious threats to both dialogue systems and applications based on LLMs. We present a systematic review of jailbreak attacks in recent years, categorize these attacks into three distinct types based on their underlying mechanism: manually designed attacks, LLM-generated attacks, and optimization-based attacks. We provide a comprehensive summary of the core principles, implementation methods, and research findings derived from relevant studies, thoroughly examine the evolutionary trajectory of jailbreak attacks on LLMs, offering a valuable reference for future research endeavors. Moreover, a concise overview of the existing security measures is offered. It introduces pertinent techniques from the perspectives of internal defense and external defense, which aim to mitigate jailbreak attacks and enhance the content security of LLM generation. Finally, we delve into the existing challenges and frontier directions in the field of jailbreak attacks on LLMs, examine the potential of multimodal approaches, model editing, and multi-agent methodologies in tackling jailbreak attacks, providing valuable insights and research prospects to further advance the field of LLM security. © 2024 Science Press. All rights reserved.

引用

页码：1156 / 1181

页数：25

共 137 条

[1] Vaswani A, Shazeer N, Parmar N, Et al., Attention is all you need[C], Advances in Neural Information Processing Systems 30: Annual Conf on Neural Information Processing Systems 2017, pp. 5998-6008, (2017)
[2] Bender E M, Gebru T, McMillan-Major A, Et al., On the dangers of stochastic parrots: Can language models be too big?[C], Proc of the 2021 ACM Conf on Fairness, Accountability, and Transparency, pp. 610-623, (2021)
[3] Open AI., GPT-4 technical report, (2023)
[4] Radford A, Wu J, Child R, Et al., Language models are unsupervised multitask learners[J], OpenAI Blog, 1, 8, pp. 1-24, (2019)
[5] Anil R, Dai A M, Firat O, Et al., PaLM 2 technical report, (2023)
[6] Touvron H, Martin L, Stone K, Et al., LLaMA 2: Open foundation and fine-tuned chat models, (2023)
[7] Sun Yu, Wang Shuohuan, Feng Shikun, Et al., ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation, (2021)
[8] Du Zhengxiao, Qian Yujie, Liu Xiao, Et al., GLM: General language model pretraining with autoregressive blank infilling, Proc of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320-335, (2022)
[9] Ren Xiaozhe, Zhou Pingyi, Meng Xinfan, Et al., PanGu-Σ: Towards trillion parameter language model with sparse heterogeneous computing, (2023)
[10] Bai Jinze, Bai Shuai, Yang Shusheng, Et al., Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond, (2023)

← 1 2 3 4 5 6 7 8 9 10 →