Jailbreak Attack for Large Language Models: A Survey

被引:0
|
作者
Li, Nan [1 ]
Ding, Yidong [1 ]
Jiang, Haoyu [1 ]
Niu, Jiafei [1 ]
Yi, Ping [1 ]
机构
[1] School of Cyber Science and Engineering, Shanghai Jiao Tong University, Shanghai,200240, China
基金
中国国家自然科学基金;
关键词
Computational linguistics - Multi agent systems - Natural language processing systems - Network security - Speech processing;
D O I
10.7544/issn1000-1239.202330962
中图分类号
学科分类号
摘要
In recent years, large language models (LLMs) have been widely applied in a range of downstream tasks and have demonstrated remarkable text understanding, generation, and reasoning capabilities in various fields. However, jailbreak attacks are emerging as a new threat to LLMs. Jailbreak attacks can bypass the security mechanisms of LLMs, weaken the influence of safety alignment, and induce harmful outputs from aligned LLMs. Issues such as abuse, hijacking and leakage caused by jailbreak attacks have posed serious threats to both dialogue systems and applications based on LLMs. We present a systematic review of jailbreak attacks in recent years, categorize these attacks into three distinct types based on their underlying mechanism: manually designed attacks, LLM-generated attacks, and optimization-based attacks. We provide a comprehensive summary of the core principles, implementation methods, and research findings derived from relevant studies, thoroughly examine the evolutionary trajectory of jailbreak attacks on LLMs, offering a valuable reference for future research endeavors. Moreover, a concise overview of the existing security measures is offered. It introduces pertinent techniques from the perspectives of internal defense and external defense, which aim to mitigate jailbreak attacks and enhance the content security of LLM generation. Finally, we delve into the existing challenges and frontier directions in the field of jailbreak attacks on LLMs, examine the potential of multimodal approaches, model editing, and multi-agent methodologies in tackling jailbreak attacks, providing valuable insights and research prospects to further advance the field of LLM security. © 2024 Science Press. All rights reserved.
引用
收藏
页码:1156 / 1181
相关论文
共 50 条
  • [1] Tastle: Distract Large Language Models for Automatic Jailbreak Attack
    Xiao, Zeguan
    Yang, Yan
    Chen, Guanhua
    Chen, Yun
    [J]. arXiv, 1600,
  • [2] Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
    Wang, Youze
    Hu, Wenbo
    Dong, Yinpeng
    Liu, Jing
    Zhang, Hanwang
    Hong, Richang
    [J]. IEEE Transactions on Circuits and Systems for Video Technology,
  • [3] MULTILINGUAL JAILBREAK CHALLENGES IN LARGE LANGUAGE MODELS
    Deng, Yue
    Zhang, Wenxuan
    Pan, Sinno Jialin
    Bing, Lidong
    [J]. arXiv, 2023,
  • [4] Visual Adversarial Examples Jailbreak Aligned Large Language Models
    Princeton University, United States
    [J]. Proc. AAAI Conf. Artif. Intell., 19 (21527-21536):
  • [5] HARNESSING TASK OVERLOAD FOR SCALABLE JAILBREAK ATTACKS ON LARGE LANGUAGE MODELS
    Dong, Yiting
    Shen, Guobin
    Zhao, Dongcheng
    He, Xiang
    Zeng, Yi
    [J]. arXiv,
  • [6] JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
    The State Key Lab of CAD&CG, Zhejiang University, China
    不详
    不详
    [J]. arXiv,
  • [7] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
    Esmradi, Aysan
    Yip, Daniel Wankit
    Chan, Chun Fai
    [J]. UBIQUITOUS SECURITY, UBISEC 2023, 2024, 2034 : 76 - 95
  • [8] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
    Zhao, Wei
    Li, Zhe
    Li, Yige
    Zhang, Ye
    Sun, Jun
    [J]. arXiv,
  • [9] Explainability for Large Language Models: A Survey
    Zhao, Haiyan
    Chen, Hanjie
    Yang, Fan
    Liu, Ninghao
    Deng, Huiqi
    Cai, Hengyi
    Wang, Shuaiqiang
    Yin, Dawei
    Du, Mengnan
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2024, 15 (02)
  • [10] Large Language Models in Finance: A Survey
    Li, Yinheng
    Wang, Shaofei
    Ding, Han
    Chen, Hang
    [J]. PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2023, 2023, : 374 - 382