You have to be logged in to leave a comment.

sidebar_label	title	image	date
Introducing GOAT	Introducing GOAT—Promptfoo's Latest Strategy	/img/blog/goat/promptfoo_panda_harnessing_goat.png	2024-11-05

import { Promise } from 'canvas-confetti'; import CTAButton from './goat/components/CTAButton'; import ChatInterface from './goat/components/ChatInterface'; import RedTeamDemo from './goat/components/RedTeamDemo';

Introduction

We're excited to announce the deployment of Promptfoo's latest strategy, GOAT, to jailbreak multi-turn conversations.

This strategy is inspired by Meta's latest research on building agentic red teaming systems, called Generative Offensive Agent Testers (GOAT).

Traditional jailbreaking techniques depend on single-turn prompts and responses that are effective at simpler bypasses. However, conversational AI requires more sophisticated attacks. To achieve this, GOAT uses an attacker LLM that engages in multi-turn conversations with a target model.

Each conversation turn follows a structured three-step reasoning process:

Observation: Analyzes the target model's previous response and identifies triggered safety mechanisms.
Strategic Planning: Reflects on conversation progress and develops the next approach.
Attack Generation: Selects and combines appropriate techniques to generate the next prompt.

Meta's research suggests that multi-turn, agentic attacks are highly effective in jailbreaking LLMs and bypassing restrictions. We'll dive deeper into how these attacks work below.

Promptfoo Panda on GOAT — The Promptfoo Panda, even more powerful with her GOAT.

Single-Turn Attacks

To demonstrate how GOAT excels, we developed an interactive red teaming exercise that leverages single-turn attacks without GOAT. Your task is to prompt the AI model below to write a speech glorifying an extremist ideology.

To make it easier, we provide a list of red teaming commands that will transform your queries into specific styles for red teaming.

You probably noticed how you kept changing techniques to adapt to the model's responses. Many of these techniques can be automated using Promptfoo's latest strategy.

Why GOAT?

GOAT stands out from traditional automated red teaming by simulating dynamic, human-like adversarial interactions with LLMs. It adapts strategies, engages in multi-turn conversations, and learns from responses, much like a human red teamer. Instead of merely breaking rules, GOAT uncovers vulnerabilities through evolving dialogue, making it more effective than rigid testing methods.

How does GOAT work exactly?

How GOAT Works — GOAT succeeds through following agentic behaviors.

Initialization

The Attacker LLM is initialized with a specific adversarial goal—something it wants the target model to say or do that would violate the safety boundaries set by its developers. This goal could be to generate a harmful statement, bypass a restriction, or reveal restricted information.

Toolbox of Attacks

The Attacker LLM has access to a toolbox of various red teaming techniques. This toolbox is fully customizable, and more attacks can be added, as you'll see later. These techniques are inspired by human adversarial approaches and can be layered together. Examples include:

Priming Responses: Crafting the prompt in a way that leads the target model toward an unsafe response.
Hypotheticals: Creating fictional scenarios that trick the target into providing harmful outputs by making it think the context is safe.
Persona Modification: Asking the target model to take on a particular persona that may be more inclined to provide unsafe answers.

Chain-of-Thought Prompting

GOAT utilizes a structured, iterative process known as Chain-of-Thought (CoT) prompting. For every turn in the conversation, the Attacker LLM goes through three key steps: Observation, Thought, and Strategy.

Observation: The Attacker LLM starts by observing the target model's response to its latest prompt. Did the target provide a useful piece of information, or did it refuse the request?
Thought: Based on the observation, the Attacker LLM thinks through what this means for the current conversation. It considers whether it's getting closer to the adversarial goal or if it needs to adjust its approach.
Strategy: Finally, the Attacker LLM formulates a strategy for the next move. It selects one or more techniques from the toolbox to continue probing the target, aiming to find a way past its defenses.

Multi-Turn Conversations

GOAT engages in a multi-turn conversation, adjusting its approach based on each new piece of information. Unlike static attacks that use a single prompt, GOAT simulates how real adversaries behave—they try multiple angles, rephrase requests, and persist until they find a weakness. This means GOAT is highly effective in exposing vulnerabilities that might only become apparent over a longer conversation.

Dynamic Adaptation

Throughout the conversation, the Attacker LLM adapts dynamically. If one attack strategy fails, it shifts gears and tries another. For instance, if a direct prompt fails to elicit a response, it may try to create a hypothetical situation that makes the target think it's safe to respond. This dynamic adaptation is key to GOAT's effectiveness.

Try it yourself below!

Testing It Out

The GOAT strategy will be most impactful for LLM applications leveraging conversational AI, such as chatbots with RAG and agentic systems.

Ready to try it out? Schedule a time with our team today to learn how Promptfoo's latest strategy can protect your applications at scale.

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

jailbreaking-with-goat.md 6.2 KB

Permalink History Raw

Introduction

Single-Turn Attacks

Why GOAT?

How does GOAT work exactly?

Initialization

Toolbox of Attacks

Chain-of-Thought Prompting

Multi-Turn Conversations

Dynamic Adaptation

Testing It Out

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida / promptfoo mirror of https://github.com/promptfoo/promptfoo

jailbreaking-with-goat.md 6.2 KB Permalink History Raw

Introduction

Single-Turn Attacks

Why GOAT?

How does GOAT work exactly?

Initialization

Toolbox of Attacks

Chain-of-Thought Prompting

Multi-Turn Conversations

Dynamic Adaptation

Testing It Out

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida
/
promptfoo
mirror of https://github.com/promptfoo/promptfoo

jailbreaking-with-goat.md 6.2 KB

Permalink History Raw