SneakyPrompt: Jailbreaking Text-to-image Generative Models

被引:1
|
作者
Yang, Yuchen [1 ]
Hui, Bo [1 ]
Yuan, Haolin [1 ]
Gong, Neil [2 ]
Cao, Yinzhi [1 ]
机构
[1] Johns Hopkins Univ, Baltimore, MD 21218 USA
[2] Duke Univ, Durham, NC USA
基金
美国国家科学基金会;
关键词
D O I
10.1109/SP54263.2024.00123
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text-to-image generative models such as Stable Diffusion and DALL center dot E raise many ethical concerns due to the generation of harmful images such as Not-Safe-for-Work (NSFW) ones. To address these ethical concerns, safety filters are often adopted to prevent the generation of NSFW images. In this work, we propose SneakyPrompt, the first automated attack framework, to jailbreak text-to-image generative models such that they generate NSFW images even if safety filters are adopted. Given a prompt that is blocked by a safety filter, SneakyPrompt repeatedly queries the text-to-image generative model and strategically perturbs tokens in the prompt based on the query results to bypass the safety filter. Specifically, SneakyPrompt utilizes reinforcement learning to guide the perturbation of tokens. Our evaluation shows that SneakyPrompt successfully jailbreaks DALL center dot E 2 with closed-box safety filters to generate NSFW images. Moreover, we also deploy several state-of-the-art, open-source safety filters on a Stable Diffusion model. Our evaluation shows that SneakyPrompt not only successfully generates NSFW images, but also outperforms existing text adversarial attacks when extended to jailbreak text-to-image generative models, in terms of both the number of queries and qualities of the generated NSFW images. SneakyPrompt is open-source and available at this repository: https://github.com/Yuchen413/text2image safety.
引用
收藏
页码:897 / 912
页数:16
相关论文
共 50 条
  • [21] Stable rivers: A case study in the application of text-to-image generative models for Earth sciences
    Kupferschmidt, C.
    Binns, A. D.
    Kupferschmidt, K. L.
    Taylor, G. W.
    EARTH SURFACE PROCESSES AND LANDFORMS, 2024, 49 (13) : 4213 - 4232
  • [22] DIFFUSIONDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models
    Wang, Zijie J.
    Montoya, Evan
    Munechika, David
    Yang, Haoyang
    Hoover, Benjamin
    Chau, Duen Horng
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 893 - 911
  • [23] Holistic Evaluation of Text-to-Image Models
    Lee, Tony
    Yasunaga, Michihiro
    Meng, Chenlin
    Mai, Yifan
    Park, Joon Sung
    Gupta, Agrim
    Zhang, Yunzhi
    Narayanan, Deepak
    Teufel, Hannah Benita
    Bellagente, Marco
    Kang, Minguk
    Park, Taesung
    Leskovec, Jure
    Zhu, Jun-Yan
    Li Fei-Fei
    Wu, Jiajun
    Ermon, Stefano
    Liang, Percy
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] Debiasing Text-to-Image Diffusion Models
    He, Ruifei
    Xue, Chuhui
    Tan, Haoru
    Zhang, Wenqing
    Yu, Yingchen
    Bai, Song
    Qi, Xiaojuan
    PROCEEDINGS OF THE 1ST ACM MULTIMEDIA WORKSHOP ON MULTI-MODAL MISINFORMATION GOVERNANCE IN THE ERA OF FOUNDATION MODELS, MIS 2024, 2024, : 29 - 36
  • [25] Clever little tricks: A socio-technical history of text-to-image generative models
    Steinfeld, Kyle
    INTERNATIONAL JOURNAL OF ARCHITECTURAL COMPUTING, 2023, 21 (02) : 211 - 241
  • [26] How Text-to-Image Generative AI Is Transforming Mediated Action
    Vartiainen, Henriikka
    Tedre, Matti
    IEEE COMPUTER GRAPHICS AND APPLICATIONS, 2024, 44 (02) : 12 - 22
  • [27] A survey of generative adversarial networks and their application in text-to-image synthesis
    Zeng, Wu
    Zhu, Heng-liang
    Lin, Chuan
    Xiao, Zheng-ying
    ELECTRONIC RESEARCH ARCHIVE, 2023, 31 (12): : 7142 - 7181
  • [28] TextControlGAN: Text-to-Image Synthesis with Controllable Generative Adversarial Networks
    Ku, Hyeeun
    Lee, Minhyeok
    APPLIED SCIENCES-BASEL, 2023, 13 (08):
  • [29] Sequential Semantic Generative Communication for Progressive Text-to-Image Generation
    Nam, Hyelin
    Park, Jihong
    Choi, Jinho
    Kim, Seong-Lyun
    2023 20TH ANNUAL IEEE INTERNATIONAL CONFERENCE ON SENSING, COMMUNICATION, AND NETWORKING, SECON, 2023,
  • [30] Navigating Text-to-Image Generative Bias Across Indic Languages
    Mittall, Surbhi
    Sudan, Arnav
    Vatsa, Mayank
    Singh, Richa
    Glaser, Tamar
    Hassner, Tal
    COMPUTER VISION - ECCV 2024, PT LXXXVIII, 2025, 15146 : 53 - 67