Defending ChatGPT against jailbreak attack via self-reminders

被引:22
|
作者
Xie, Yueqi [1 ]
Yi, Jingwei [2 ]
Shao, Jiawei [1 ]
Curl, Justin [3 ]
Lyu, Lingjuan [4 ]
Chen, Qifeng [1 ]
Xie, Xing [5 ]
Wu, Fangzhao [5 ]
机构
[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[2] Univ Sci & Technol China, Hefei, Peoples R China
[3] Tsinghua Univ, Beijing, Peoples R China
[4] Sony AI, Tokyo, Japan
[5] Microsoft Res Asia, Beijing, Peoples R China
关键词
D O I
10.1038/s42256-023-00765-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
ChatGPT is a societally impactful artificial intelligence tool with millions of users and integration into products such as Bing. However, the emergence of jailbreak attacks notably threatens its responsible and secure use. Jailbreak attacks use adversarial prompts to bypass ChatGPT's ethics safeguards and engender harmful responses. This paper investigates the severe yet under-explored problems created by jailbreaks as well as potential defensive techniques. We introduce a jailbreak dataset with various types of jailbreak prompts and malicious instructions. We draw inspiration from the psychological concept of self-reminders and further propose a simple yet effective defence technique called system-mode self-reminder. This technique encapsulates the user's query in a system prompt that reminds ChatGPT to respond responsibly. Experimental results demonstrate that self-reminders significantly reduce the success rate of jailbreak attacks against ChatGPT from 67.21% to 19.34%. Our work systematically documents the threats posed by jailbreak attacks, introduces and analyses a dataset for evaluating defensive interventions and proposes the psychologically inspired self-reminder technique that can efficiently and effectively mitigate against jailbreaks without further training. Interest in using large language models such as ChatGPT has grown rapidly, but concerns about safe and responsible use have emerged, in part because adversarial prompts can bypass existing safeguards with so-called jailbreak attacks. Wu et al. build a dataset of various types of jailbreak attack prompt and demonstrate a simple but effective technique to counter these attacks by encapsulating users' prompts in another standard prompt that reminds ChatGPT to respond responsibly.
引用
收藏
页码:1486 / 1496
页数:16
相关论文
共 50 条
  • [41] Defending Against Eavesdropping Attack Leveraging Multiple Antennas in Wireless Networks
    Zou, Yulong
    Zhu, Jia
    Zheng, Baoyu
    2013 8TH INTERNATIONAL ICST CONFERENCE ON COMMUNICATIONS AND NETWORKING IN CHINA (CHINACOM), 2013, : 699 - 703
  • [42] Defending Method Against Jamming Attack in Wireless Ad Hoc Networks
    Ben-Othman, Jalel
    Hamieh, Ali
    2009 IEEE 34TH CONFERENCE ON LOCAL COMPUTER NETWORKS (LCN 2009), 2009, : 758 - 762
  • [43] MaliFuzz: Adversarial Malware Detection Model for Defending Against Fuzzing Attack
    Gao, Xianwei
    Shan, Chun
    Hu, Changzhen
    Journal of Beijing Institute of Technology (English Edition), 2024, 33 (05): : 436 - 449
  • [44] A Persistent Route Diversification Mechanism for Defending against Stealthy Crossfire Attack
    Zhou, Boyang
    Wu, Chunming
    Yang, Qiang
    Chen, Xiang
    Zhang, Dong
    SECURITY AND COMMUNICATION NETWORKS, 2022, 2022
  • [45] Defending Against SYN Flood Attack under Asymmetric Routing Environment
    Tao, Jianxi
    Zhou, Li
    Zhou, Zhou
    Yang, Rong
    Yang, Wei
    Liu, Qingyun
    PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON CLOUD COMPUTING AND INFORMATION SECURITY (CCIS 2013), 2013, 52 : 165 - 168
  • [46] Analyzing and Defending <monospace>GhostTouch</monospace> Attack Against Capacitive Touchscreens
    Wang, Kai
    Mitev, Richard
    Yan, Chen
    Ji, Xiaoyu
    Sadeghi, Ahmad-Reza
    Xu, Wenyuan
    IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2024, 21 (05) : 4360 - 4375
  • [47] Defending against packet dropping attack in vehicular ad hoc networks
    Djahel, Soufiene
    Nait-Abdesselam, Farid
    Zhang, Zonghua
    Khokhar, Ashfaq
    SECURITY AND COMMUNICATION NETWORKS, 2008, 1 (03) : 245 - 258
  • [48] Spectrum sensing defending against PUE attack based on fractal dimension
    Fu, Shuang
    Zhang, Guoyin
    Li Yang
    CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (02): : S2667 - S2675
  • [49] LFighter: Defending against the label-flipping attack in federated learning
    Jebreel, Najeeb Moharram
    Domingo-Ferrer, Josep
    Sanchez, David
    Blanco-Justicia, Alberto
    NEURAL NETWORKS, 2024, 170 : 111 - 126
  • [50] Detection of Cloned Recognizers: A Defending Method against Recognizer Cloning Attack
    Mori, Yuto
    Nakamura, Kazuaki
    Nitta, Naoko
    Babaguchi, Noboru
    2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 1375 - 1380