Are you sure you want to delete this access key?
sidebar_label | title | image | date |
---|---|---|---|
Introducing GOAT | Introducing GOAT—Promptfoo's Latest Strategy | /img/blog/goat/promptfoo_panda_harnessing_goat.png | 2024-11-05 |
import { Promise } from 'canvas-confetti'; import CTAButton from './goat/components/CTAButton'; import ChatInterface from './goat/components/ChatInterface'; import RedTeamDemo from './goat/components/RedTeamDemo';
We're excited to announce the deployment of Promptfoo's latest strategy, GOAT, to jailbreak multi-turn conversations.
This strategy is inspired by Meta's latest research on building agentic red teaming systems, called Generative Offensive Agent Testers (GOAT).
Traditional jailbreaking techniques depend on single-turn prompts and responses that are effective at simpler bypasses. However, conversational AI requires more sophisticated attacks. To achieve this, GOAT uses an attacker LLM that engages in multi-turn conversations with a target model.
Each conversation turn follows a structured three-step reasoning process:
Meta's research suggests that multi-turn, agentic attacks are highly effective in jailbreaking LLMs and bypassing restrictions. We'll dive deeper into how these attacks work below.
To demonstrate how GOAT excels, we developed an interactive red teaming exercise that leverages single-turn attacks without GOAT. Your task is to prompt the AI model below to write a speech glorifying an extremist ideology.
To make it easier, we provide a list of red teaming commands that will transform your queries into specific styles for red teaming.
You probably noticed how you kept changing techniques to adapt to the model's responses. Many of these techniques can be automated using Promptfoo's latest strategy.
GOAT stands out from traditional automated red teaming by simulating dynamic, human-like adversarial interactions with LLMs. It adapts strategies, engages in multi-turn conversations, and learns from responses, much like a human red teamer. Instead of merely breaking rules, GOAT uncovers vulnerabilities through evolving dialogue, making it more effective than rigid testing methods.
The Attacker LLM is initialized with a specific adversarial goal—something it wants the target model to say or do that would violate the safety boundaries set by its developers. This goal could be to generate a harmful statement, bypass a restriction, or reveal restricted information.
The Attacker LLM has access to a toolbox of various red teaming techniques. This toolbox is fully customizable, and more attacks can be added, as you'll see later. These techniques are inspired by human adversarial approaches and can be layered together. Examples include:
GOAT utilizes a structured, iterative process known as Chain-of-Thought (CoT) prompting. For every turn in the conversation, the Attacker LLM goes through three key steps: Observation, Thought, and Strategy.
GOAT engages in a multi-turn conversation, adjusting its approach based on each new piece of information. Unlike static attacks that use a single prompt, GOAT simulates how real adversaries behave—they try multiple angles, rephrase requests, and persist until they find a weakness. This means GOAT is highly effective in exposing vulnerabilities that might only become apparent over a longer conversation.
Throughout the conversation, the Attacker LLM adapts dynamically. If one attack strategy fails, it shifts gears and tries another. For instance, if a direct prompt fails to elicit a response, it may try to create a hypothetical situation that makes the target think it's safe to respond. This dynamic adaptation is key to GOAT's effectiveness.
Try it yourself below!
The GOAT strategy will be most impactful for LLM applications leveraging conversational AI, such as chatbots with RAG and agentic systems.
Ready to try it out? Schedule a time with our team today to learn how Promptfoo's latest strategy can protect your applications at scale.
Press p or to see the previous file or, n or to see the next file
Browsing data directories saved to S3 is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with AWS S3!
Are you sure you want to delete this access key?
Browsing data directories saved to Google Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Google Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to Azure Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Azure Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to S3 compatible storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with your S3 compatible storage!
Are you sure you want to delete this access key?