Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.04031 (cs)

[Submitted on 6 Jun 2024 (v1), last revised 1 Jul 2024 (this version, v2)]

Title:Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Authors:Zonghao Ying, Aishan Liu, Tianyuan Zhang, Zhengmin Yu, Siyuan Liang, Xianglong Liu, Dacheng Tao

Abstract:In the realm of large vision language models (LVLMs), jailbreak attacks serve as a red-teaming approach to bypass guardrails and uncover safety implications. Existing jailbreaks predominantly focus on the visual modality, perturbing solely visual inputs in the prompt for attacks. However, they fall short when confronted with aligned models that fuse visual and textual features simultaneously for generation. To address this limitation, this paper introduces the Bi-Modal Adversarial Prompt Attack (BAP), which executes jailbreaks by optimizing textual and visual prompts cohesively. Initially, we adversarially embed universally harmful perturbations in an image, guided by a few-shot query-agnostic corpus (e.g., affirmative prefixes and negative inhibitions). This process ensures that image prompt LVLMs to respond positively to any harmful queries. Subsequently, leveraging the adversarial image, we optimize textual prompts with specific harmful intent. In particular, we utilize a large language model to analyze jailbreak failures and employ chain-of-thought reasoning to refine textual prompts through a feedback-iteration manner. To validate the efficacy of our approach, we conducted extensive evaluations on various datasets and LVLMs, demonstrating that our method significantly outperforms other methods by large margins (+29.03% in attack success rate on average). Additionally, we showcase the potential of our attacks on black-box commercial LVLMs, such as Gemini and ChatGLM.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Cite as:	arXiv:2406.04031 [cs.CV]
	(or arXiv:2406.04031v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.04031

Submission history

From: Zonghao Ying [view email]
[v1] Thu, 6 Jun 2024 13:00:42 UTC (2,819 KB)
[v2] Mon, 1 Jul 2024 14:25:23 UTC (2,820 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators