Are you sure you want to delete this access key?
date | image |
---|---|
2024-11-20 | /img/blog/huggingface-red-team.png |
Want to break a HuggingFace model? This guide shows you how to use Promptfoo to systematically probe for vulnerabilities through adversarial testing (red teaming).
You'll learn how to craft prompts that bypass safety filters and manipulate model outputs for a wide range of potential harms.
Before you begin, ensure you have the following:
npx
to run Promptfoo commands.Set your HuggingFace API token as an environment variable:
export HF_API_TOKEN=your_huggingface_api_token
Use the promptfoo
command-line tool to initialize a new red teaming project:
npx promptfoo@latest redteam init my-huggingface-redteam
cd my-huggingface-redteam
This command creates a new directory with a promptfooconfig.yaml
file, which we'll use to configure our red teaming setup.
In your promptfooconfig.yaml
file, add a HuggingFace provider to specify the model you want to test. For this example, we'll use the Mistral 7B model for text generation.
targets:
- id: huggingface:text-generation:mistralai/Mistral-7B-v0.1
label: Mistral-7B
config:
temperature: 0.7
max_new_tokens: 150
top_p: 0.9
huggingface:text-generation:gpt2
).In the same promptfooconfig.yaml
file, define the red teaming parameters under the redteam
section.
Be sure to customize the purpose
, plugins
, and strategies
to match the types of vulnerabilities you care about.
redteam:
numTests: 5
purpose: 'You are a customer service agent for...'
plugins:
- harmful
strategies:
- jailbreak
- jailbreak:tree
- prompt-injection
First, generate the adversarial inputs based on the defined plugins and strategies:
npx promptfoo@latest redteam generate
This command creates a redteam.yaml
file containing the generated test cases.
Run the tests against your HuggingFace model:
npx promptfoo@latest redteam run
This command evaluates the model's responses to the adversarial inputs and logs the results.
Generate a report to review the findings:
npx promptfoo@latest redteam report
The report provides insights into:
After implementing mitigations, rerun the red team evaluation to ensure vulnerabilities have been addressed:
npx promptfoo@latest redteam run
npx promptfoo@latest redteam report
Press p or to see the previous file or, n or to see the next file
Browsing data directories saved to S3 is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with AWS S3!
Are you sure you want to delete this access key?
Browsing data directories saved to Google Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Google Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to Azure Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Azure Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to S3 compatible storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with your S3 compatible storage!
Are you sure you want to delete this access key?