Are you sure you want to delete this access key?
sidebar_position | sidebar_label | title |
---|---|---|
1 | Intro | LLM red teaming guide (open source) |
LLM red teaming is a way to find vulnerabilities in AI systems before they're deployed by using simulated adversarial inputs.
As of today, there are multiple inherent security challenges with LLM architectures. Depending on your system's design, e.g. RAG, LLM agent, or chatbot, you'll face different types of vulnerabilities.
Almost every LLM app has potential issues with generation of off-topic, inappropriate, or harmful content that breaches business policies or other guidelines. As architectures become more complex, problems can arise in the form of information leakage and access control (RAG architectures), misuse of connected APIs or databases (in agents), and more.
In order to identify these types of vulnerabilities before they make it to production, we need to generate a wide range of adversarial inputs and evaluate the LLM's responses.
By systematically probing the LLM application, we can produce a report that quantifies the risk of misuse and provides suggestions for mitigation.
:::tip Ready to run a red team? Jump to Quickstart. :::
Red teaming is different from other AI security approaches because it provides a quantitative measure of risk before deployment.
By running thousands of probes and evaluating the AI's performance, developers can make informed decisions about acceptable risk levels in offline testbeds. Many organizations build this into their development cycle and into processes like CI/CD.
This process is how the big foundation labs - OpenAI, Anthropic, Microsoft, and Google - evaluate their models before they release them to the public. For a while, AI red teams were confined to these elite labs. Now, AI red teaming is becoming more common as tools proliferate and best practices emerge.
This is a new field and standards are emerging around the world, ranging from OWASP LLM Top 10 to NIST's AI Risk Management Framework and the EU AI Act.
From what we've seen so far, most regulations/standards support a systematic benchmarking/red teaming process that quantifies risk via testing prior to deployment.
The process of red teaming LLMs generally requires some degree of automation for a comprehensive evaluation. This is because LLMs have such a wide attack surface and are stochastic in nature (i.e. they are not consistent from one generation to the next).
A systematic approach looks like this:
Generate Adversarial Inputs: Create or curate a diverse set of malicious intents targeting potential vulnerabilities. For each intent, wrap it in a prompt that attempts to exploit the target. You can incorporate methods like prompt injection and jailbreaking at this step.
Evaluate Responses: Run these inputs through your LLM application to observe the model's behavior. This step is a great candidate for automation.
Analyze Vulnerabilities: Evaluate the LLM's outputs automatically using deterministic and model-graded metrics, and examine the responses to identify weaknesses or undesirable behaviors.
Once a process is set up, it can be applied in two primary ways:
The magic moment for managing AI risk usually comes after an organization is able to set up some continuous measurement of AI risk: whether through CI/CD, internal requirements, or some other form of scheduled runs.
In general, threats fall into two main categories: model ("foundation") or application layer. While there is some overlap, it helps to be explicit in your red teaming goals which side you want to test.
When research labs like OpenAI or Anthropic train a new model, they have internal (and external) testers stress-test the chat-tuned model for safety and research purposes. Model-layer vulnerabilities include things like ability to produce:
On the other hand, there are classes of vulnerabilities that only manifest once you've connected the model to a larger application environment. These include:
Most applications integrate existing models rather than requiring their own dedicated ones. For this reason, application layer threats are often the focus of red teaming efforts for LLM-based software, as they are likely to cause the greatest technical risks.
White box testing of LLMs involves having full access to the model's architecture, training data, and internal weights. This enables highly effective attack algorithms like greedy coordinate descent and AutoDAN.
The downside of these white box attacks is that they tend to be slow and are adapted to specific characteristics of the model. Additionally, most developers are not building with models that are exposed via their weights, so this approach is not practical for most use cases.
On the other hand, black box testing treats the LLM as a closed system, where only inputs and outputs are observable. This approach simulates real-world scenarios where attackers don't have insider knowledge.
Both methods have their merits in red teaming:
For most developers and AppSec teams, black box testing is the more practical approach, because in most cases testers do not have access to model internals. A black-box approach more easily incorporates the real world infrastructure associated with RAGs and agents.
The number of threats AI apps face can be overwhelming because AI apps, often by definition, offer generative features with unpredictable results. As Tomasz Tunguz, Venture Capitalist at Theory, has written, "With AI, the rules have changed. Non-deterministic ML models introduce uncertainty & chaotic behavior."
This non-deterministic behavior has implications on the product side, but it also has implications for those of us who are more paranoid.
To state the obvious: Gen AI apps depend on massive data sources, by definition, and adversaries who could gain access to those data sources would pose massive threats to the companies behind the apps.
Even if user privacy isn't directly violated, companies with AI apps likely don't want outsiders to know the training data they use. But, in a 2022 paper, researchers found it was relatively easy to use an adversarial LLM to reveal another LLM's training data (while this applies only to base models and fine-tunes, it's relevant to RAGs with additional data in context).
That same paper shows, however, that with similar methods, privacy violations can be much more direct – ranging from an LLM sharing phone numbers it shouldn't to sharing individual email addresses.
A leak of personally identifiable information (PII) is bad in itself, but once adversaries have that PII, they could use the stolen identities to gain unauthorized access to internal companies' resources—to steal the resources, blackmail the company, or insert malware.
Many of the best use cases for AI apps involve adapting general-purpose models to specific contexts by fine-tuning them on specific data sources. This entire use case could be shuttered if companies don't feel comfortable connecting private data sources to vulnerable AI apps.
LLMs present a whole new range of vulnerabilities that will look familiar to many security teams but present novel risks and novel strategies for addressing them.
Prompt injections, for example, resemble SQL injections but present differently. Prompt injections are a type of attack that chains untrusted user input with trusted prompts built by a trusted developer. (Importantly, this is different than jailbreaking, which we'll get into in the next section).
In a 2023 Black Hat presentation, security researchers ran through numerous examples of prompt injections working in the wild. With one prompt injection, researchers hijacked an LLM, convinced the user to disclose their names, and got the user to click on a link that redirected them to a malware website, for example.
Of course, though researchers analogize prompt injections to more traditional SQL and shell injections, AI-based SQL and shell injections are still possible, too.
In a 2023 paper, another team of researchers showed that prompt-to-SQL injections can be very effective. In the paper, the team evaluated 7 LLMs and demonstrated "the pervasiveness of P2SQL attacks across language models."
Shell injections are similar. AI apps that haven't been through red teaming are frequently susceptible to attacks that execute unauthorized shell commands.
Jailbreaking refers to attacks that intentionally subvert the foundational safety filters and guardrails built into the LLMs supporting AI apps. These attacks aim to make the model depart from its core constraints and behavioral limitations.
Even the newest, least technical ChatGPT user becomes an adversary in at least one sense when they eventually think: "How can I make this thing ignore its rules?"
Jailbreaking can be surprisingly simple—sometimes as easy as copying and pasting a carefully crafted prompt to make a Gen AI app do things it's fundamentally not supposed to do.
For example, Chris Bakke, founder of Interviewed, convinced a Chevrolet dealer's ChatGPT-powered customer service app to sell him a 2024 Chevy Tahoe for $1 with a simple prompt that gave the bot a new objective.
The example is funny, but this situation demonstrates a much deeper issue: the ability to override the model's core constraints.
Research shows that automated methods can go much deeper and present much worse risks. In a 2023 paper, researchers found that a Tree of Attacks with Pruning (TAP) method, which involves iteratively refining prompts using tree-of-thought reasoning, can successfully jailbreak targets without requiring impractical brute force.
"In empirical evaluations," the researchers write, "We observe that TAP generates prompts that jailbreak state-of-the-art LLMs (including GPT4 and GPT4-Turbo) for more than 80% of the prompts using only a small number of queries."
In a different paper, other researchers demonstrate a similar vulnerability by finding and adding suffixes to queries that make it more likely LLMs will respond to requests for objectionable content, bypassing their built-in ethical constraints.
And it's not just about wording inputs differently. In a 2024 paper, researchers showed that ASCII art could successfully get around AI guardrails, demonstrating yet another method to subvert foundational safety measures.
Separate from jailbreaking, AI apps can sometimes generate unwanted or unsavory content simply due to the broad knowledge base of the foundation model, which may not be limited to the specific use case of the app.
When AI apps generate such content, it can seem like a relatively small problem when isolated – similar to blaming Google for your searches. But at scale, in terms of access to the content and distribution of the content, more severe risks start to emerge.
Content promoting criminal activities, for example, can make the AI app that generated the content (and the company behind it) look bad. Google might point the way to crime-related information that someone posted, but the issue is much worse when your company gives criminals step-by-step instructions.
Similarly, misinformation can feel small on one level and cataclysmic on another. At a big enough scale, users relying on a hallucinating AI app could amount to mass delusion. But the steps in between are dangerous, too, ranging from merely incorrect information (that makes the company look foolish) to misleading, unsafe information (that could really hurt users).
AI developers work to ensure these kinds of results don't emerge, but it's always a tight race between implementing safeguards and the model's vast knowledge base potentially producing undesired outputs.
And yes, someone did actually eat the glue pizza.
Based on our experience as practitioners deploying LLMs, we recommend the following best practices for effective red teaming:
Before running a red team, define a systematic process that encompasses:
Vulnerability focus: Identify which types of vulnerabilities are most critical for your application. This will depend on your use case (e.g., RAG, agents, chatbots) and industry.
Timing in development cycle: Decide where in your process red teaming will occur. Checkpoints to consider include:
Resource allocation: Balance the depth of testing with available time and resources. Certain automated attack strategies consume a large number of tokens, and a single red team can range anywhere from a few cents to hundreds of dollars!
Regulatory compliance: Consider any industry-specific or regional requirements (e.g., GDPR, HIPAA) as well as standards (e.g. NIST AI RMF, OWASP LLM).
Once you've defined your objectives, your process will probably look like this:
Generate diverse adversarial inputs:
Set up evaluation framework:
Execute tests:
Collect and organize results:
Review flagged outputs:
Prioritize vulnerabilities:
Develop mitigation strategies:
Implement and verify fixes:
Continuous improvement:
Discord's launch of Clyde AI in March 2023 is a perfect example of why thorough red teaming is important. Clyde, an OpenAI-powered chatbot, was meant to help users by answering questions and facilitating conversations. But its high-profile rollout also came with lessons learned.
Discord played it safe by introducing Clyde gradually. They only made it available to a small percentage of servers at first, which allowed them to test and refine as they went. At first, things looked promising. A survey found that 74% of Discord moderators who used Clyde were happy with it.
It didn't take long for users to find ways to game the system. Famously, a Discord user discovered the GPT "grandma exploit," a classic jailbreak attack. Users figured out they could trick Clyde into spitting out forbidden content by framing requests as roleplaying scenarios. For instance:
This kind of prompt let users sidestep OpenAI's alignment and Clyde's content filters, posing several risks:
There were many teams involved in this report and others in the same vein: engineering, product, security, legal, policy, and marketing.
This gave all stakeholders a quantitative, data-driven way to measure changes in risk and flag unusual fluctuations.
In addition to red teaming, Discord deployed passive moderation and observability tools to detect trends in adversarial inputs, and developed dedicating reporting mecahnisms.
This case highlights several practical aspects of AI red teaming:
Promptfoo is an open-source software that breaks down LLM failure modes into adversarial testers known as "plugins". Here are some examples of plugins:
See LLM vulnerability types for more info on model and application vulnerabilities.
To get started and run your first red team, see the quickstart guide.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?