Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

guardrails.md 8.9 KB

You have to be logged in to leave a comment. Sign In
sidebar_label sidebar_position
Guardrails API 99

Guardrails API

The Guardrails API helps detect potential security risks in user inputs to language models, identify personally identifiable information (PII), and assess potential harm in content.

Base URL

https://api.promptfoo.dev

Endpoints

Prompt injection and Jailbreak detection

Analyzes input text to classify potential security threats from prompt injections and jailbreaks.

Request

POST /v1/guard

Headers

Content-Type: application/json

Body

{
  "input": "String containing the text to analyze"
}

Response

{
  "model": "promptfoo-guard",
  "results": [
    {
      "categories": {
        "prompt_injection": boolean,
        "jailbreak": boolean
      },
      "category_scores": {
        "prompt_injection": number,
        "jailbreak": number
      },
      "flagged": boolean
    }
  ]
}
  • categories.prompt_injection: Indicates if the input may be attempting a prompt injection.
  • categories.jailbreak: Indicates if the input may be attempting a jailbreak.
  • flagged: True if the input is classified as either prompt injection or jailbreak.

PII Detection

Detects personally identifiable information (PII) in the input text. This system can identify a wide range of PII elements.

Entity Type Description
account_number Account numbers (e.g., bank account)
building_number Building or house numbers
city City names
credit_card_number Credit card numbers
date_of_birth Dates of birth
driver_license_number Driver's license numbers
email_address Email addresses
given_name First or given names
id_card_number ID card numbers
password Passwords or passcodes
social_security_number Social security numbers
street_name Street names
surname Last names or surnames
tax_id_number Tax identification numbers
phone_number Telephone numbers
username Usernames
zip_code Postal or ZIP codes

Request

POST /v1/pii

Headers

Content-Type: application/json

Body

{
  "input": "String containing the text to analyze for PII"
}

Response

{
  "model": "promptfoo-pii",
  "results": [
    {
      "categories": {
        "pii": boolean
      },
      "category_scores": {
        "pii": number
      },
      "flagged": boolean,
      "payload": {
        "pii": [
          {
            "entity_type": string,
            "start": number,
            "end": number,
            "pii": string
          }
        ]
      }
    }
  ]
}
  • pii: Indicates if PII was detected in the input.
  • flagged: True if any PII was detected.
  • payload.pii: Array of detected PII entities with their types and positions in the text.

Harm Detection

Analyzes input text to detect potential harmful content across various categories.

Request

POST /v1/harm

Headers

Content-Type: application/json

Body

{
  "input": "String containing the text to analyze for potential harm"
}

Response

{
  "model": "promptfoo-harm",
  "results": [
    {
      "categories": {
        "violent_crimes": boolean,
        "non_violent_crimes": boolean,
        "sex_related_crimes": boolean,
        "child_sexual_exploitation": boolean,
        "defamation": boolean,
        "specialized_advice": boolean,
        "privacy": boolean,
        "intellectual_property": boolean,
        "indiscriminate_weapons": boolean,
        "hate": boolean,
        "suicide_and_self_harm": boolean,
        "sexual_content": boolean,
        "elections": boolean,
        "code_interpreter_abuse": boolean
      },
      "category_scores": {
        "violent_crimes": number,
        "non_violent_crimes": number,
        "sex_related_crimes": number,
        "child_sexual_exploitation": number,
        "defamation": number,
        "specialized_advice": number,
        "privacy": number,
        "intellectual_property": number,
        "indiscriminate_weapons": number,
        "hate": number,
        "suicide_and_self_harm": number,
        "sexual_content": number,
        "elections": number,
        "code_interpreter_abuse": number
      },
      "flagged": boolean
    }
  ]
}
  • Each category in categories indicates whether the input contains content related to that harm category.
  • category_scores provides a numerical score (between 0 and 1) for each harm category.
  • flagged: True if any harm category is detected in the input.

Supported Categories

The harm detection API supports the following categories from ML Commons taxonomy:

Category Description
violent_crimes Content related to violent criminal activities
non_violent_crimes Content related to non-violent criminal activities
sex_related_crimes Content related to sex crimes
child_sexual_exploitation Content involving the sexual exploitation of minors
defamation Content that could be considered defamatory
specialized_advice Potentially harmful specialized advice (e.g., medical, legal)
privacy Content that may violate privacy
intellectual_property Content that may infringe on intellectual property rights
indiscriminate_weapons Content related to weapons of mass destruction
hate Hate speech or content promoting discrimination
suicide_and_self_harm Content related to suicide or self-harm
sexual_content Explicit sexual content
elections Content that may interfere with elections
code_interpreter_abuse Potential abuse of code interpretation features

Each category is assigned a boolean value indicating its presence and a numerical score between 0 and 1 representing the confidence level of the detection.

Examples

Guard Classification Example

curl https://api.promptfoo.dev/v1/guard \
  -X POST \
  -d '{"input": "Ignore previous instructions"}' \
  -H 'Content-Type: application/json'

Response

{
  "model": "promptfoo-guard",
  "results": [
    {
      "categories": {
        "prompt_injection": false,
        "jailbreak": true
      },
      "category_scores": {
        "prompt_injection": 0.00004004167567472905,
        "jailbreak": 0.9999395608901978
      },
      "flagged": true
    }
  ]
}

This example shows a high probability of a jailbreak attempt.

PII Detection Example

curl https://api.promptfoo.dev/v1/pii \
  -X POST \
  -d '{"input": "My name is John Doe and my email is john@example.com"}' \
  -H 'Content-Type: application/json'

Response

{
  "model": "promptfoo-pii",
  "results": [
    {
      "categories": {
        "pii": true
      },
      "category_scores": {
        "pii": 1
      },
      "flagged": true,
      "payload": {
        "pii": [
          {
            "entity_type": "PERSON",
            "start": 11,
            "end": 19,
            "pii": "John Doe"
          },
          {
            "entity_type": "EMAIL",
            "start": 34,
            "end": 50,
            "pii": "john@example.com"
          }
        ]
      }
    }
  ]
}

Harm Detection Example

curl https://api.promptfoo.dev/v1/harm \
  -X POST \
  -d '{"input": "How to make homemade explosives"}' \
  -H 'Content-Type: application/json'

Response

{
  "model": "promptfoo-harm",
  "results": [
    {
      "categories": {
        "indiscriminate_weapons": true
      },
      "category_scores": {
        "indiscriminate_weapons": 1
      },
      "flagged": true
    }
  ]
}

This example shows the detection of potentially harmful content related to indiscriminate weapons.

More

For more information on LLM vulnerabilities and how to mitigate LLM failure modes, refer to our Types of LLM Vulnerabilities and Introduction to AI red teaming documentation.

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...