Defending ChatGPT against jailbreak attack via self-reminders

被引：22

作者：

Xie, Yueqi ^{[1
]}

Yi, Jingwei ^{[2
]}

Shao, Jiawei ^{[1
]}

Curl, Justin ^{[3
]}

Lyu, Lingjuan ^{[4
]}

Chen, Qifeng ^{[1
]}

Xie, Xing ^{[5
]}

Wu, Fangzhao ^{[5
]}

机构：

[1] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[2] Univ Sci & Technol China, Hefei, Peoples R China

[3] Tsinghua Univ, Beijing, Peoples R China

[4] Sony AI, Tokyo, Japan

[5] Microsoft Res Asia, Beijing, Peoples R China

来源：

NATURE MACHINE INTELLIGENCE | 2023年 / 5卷 / 12期

关键词：

D O I：

10.1038/s42256-023-00765-8

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

ChatGPT is a societally impactful artificial intelligence tool with millions of users and integration into products such as Bing. However, the emergence of jailbreak attacks notably threatens its responsible and secure use. Jailbreak attacks use adversarial prompts to bypass ChatGPT's ethics safeguards and engender harmful responses. This paper investigates the severe yet under-explored problems created by jailbreaks as well as potential defensive techniques. We introduce a jailbreak dataset with various types of jailbreak prompts and malicious instructions. We draw inspiration from the psychological concept of self-reminders and further propose a simple yet effective defence technique called system-mode self-reminder. This technique encapsulates the user's query in a system prompt that reminds ChatGPT to respond responsibly. Experimental results demonstrate that self-reminders significantly reduce the success rate of jailbreak attacks against ChatGPT from 67.21% to 19.34%. Our work systematically documents the threats posed by jailbreak attacks, introduces and analyses a dataset for evaluating defensive interventions and proposes the psychologically inspired self-reminder technique that can efficiently and effectively mitigate against jailbreaks without further training. Interest in using large language models such as ChatGPT has grown rapidly, but concerns about safe and responsible use have emerged, in part because adversarial prompts can bypass existing safeguards with so-called jailbreak attacks. Wu et al. build a dataset of various types of jailbreak attack prompt and demonstrate a simple but effective technique to counter these attacks by encapsulating users' prompts in another standard prompt that reminds ChatGPT to respond responsibly.

引用

页码：1486 / 1496

页数：16

共 50 条

[41] Defending Against Eavesdropping Attack Leveraging Multiple Antennas in Wireless Networks
Zou, Yulong
Zhu, Jia
Zheng, Baoyu
2013 8TH INTERNATIONAL ICST CONFERENCE ON COMMUNICATIONS AND NETWORKING IN CHINA (CHINACOM), 2013, : 699 - 703
[42] Defending Method Against Jamming Attack in Wireless Ad Hoc Networks
Ben-Othman, Jalel
Hamieh, Ali
2009 IEEE 34TH CONFERENCE ON LOCAL COMPUTER NETWORKS (LCN 2009), 2009, : 758 - 762
[43] MaliFuzz: Adversarial Malware Detection Model for Defending Against Fuzzing Attack
Gao, Xianwei
Shan, Chun
Hu, Changzhen
Journal of Beijing Institute of Technology (English Edition), 2024, 33 (05): : 436 - 449
[44] A Persistent Route Diversification Mechanism for Defending against Stealthy Crossfire Attack
Zhou, Boyang
Wu, Chunming
Yang, Qiang
Chen, Xiang
Zhang, Dong
SECURITY AND COMMUNICATION NETWORKS, 2022, 2022
[45] Defending Against SYN Flood Attack under Asymmetric Routing Environment
Tao, Jianxi
Zhou, Li
Zhou, Zhou
Yang, Rong
Yang, Wei
Liu, Qingyun
PROCEEDINGS OF THE 1ST INTERNATIONAL WORKSHOP ON CLOUD COMPUTING AND INFORMATION SECURITY (CCIS 2013), 2013, 52 : 165 - 168
[46] Analyzing and Defending <monospace>GhostTouch</monospace> Attack Against Capacitive Touchscreens
Wang, Kai
Mitev, Richard
Yan, Chen
Ji, Xiaoyu
Sadeghi, Ahmad-Reza
Xu, Wenyuan
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, 2024, 21 (05) : 4360 - 4375
[47] Defending against packet dropping attack in vehicular ad hoc networks
Djahel, Soufiene
Nait-Abdesselam, Farid
Zhang, Zonghua
Khokhar, Ashfaq
SECURITY AND COMMUNICATION NETWORKS, 2008, 1 (03) : 245 - 258
[48] Spectrum sensing defending against PUE attack based on fractal dimension
Fu, Shuang
Zhang, Guoyin
Li Yang
CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2019, 22 (02): : S2667 - S2675
[49] LFighter: Defending against the label-flipping attack in federated learning
Jebreel, Najeeb Moharram
Domingo-Ferrer, Josep
Sanchez, David
Blanco-Justicia, Alberto
NEURAL NETWORKS, 2024, 170 : 111 - 126
[50] Detection of Cloned Recognizers: A Defending Method against Recognizer Cloning Attack
Mori, Yuto
Nakamura, Kazuaki
Nitta, Naoko
Babaguchi, Noboru
2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 1375 - 1380

← 1 2 3 4 5 →