...

README.md

6b7b0d228e

feat(assertions): add conversational relevancy metric (#2130)

3 weeks ago

promptfooconfig.yaml

6b7b0d228e

feat(assertions): add conversational relevancy metric (#2130)

3 weeks ago

You have to be logged in to leave a comment.

Conversation Relevance Example

This example demonstrates how to use the conversation-relevance assertion to evaluate whether chatbot responses remain relevant throughout a conversation.

What is Conversation Relevance?

The conversation relevance metric evaluates whether each response in a conversation is relevant to the context and previous messages. It uses a sliding window approach to analyze conversation segments.

Running the Example

Install dependencies:
```
npm install -g promptfoo
```
Set your OpenAI API key:
```
export OPENAI_API_KEY=your-api-key
```
Run the evaluation:
```
promptfoo eval
```

Example Test Cases

1. Single-turn Evaluation

Tests basic relevance for a single query-response pair about travel to Paris.

2. Multi-turn Travel Conversation

Evaluates a complete conversation about travel planning where all responses should be relevant.

3. Conversation with Irrelevant Response

Demonstrates detection of an off-topic response (stock market comment) in the middle of a conversation about wedding planning.

4. Technical Support Conversation

Shows a high-quality technical support conversation with a high relevance threshold (0.95).

Configuration Options

threshold: Minimum score required to pass (0-1)
config.windowSize: Number of messages in each sliding window (default: 5)
provider: Override the default grading model

Interpreting Results

Score: Proportion of conversation windows deemed relevant
Pass/Fail: Based on whether the score meets the threshold
Reason: Explanation when responses are found irrelevant

Tips

Use lower thresholds (0.7-0.8) for general conversations
Use higher thresholds (0.9-0.95) for specialized domains like technical support
Adjust window size based on conversation complexity
Consider using more capable models (GPT-4) for grading complex conversations

How Scoring Works

The metric evaluates each message position using a sliding window approach. For example, with a 5-message conversation and window size of 3:

Window 1: Message 1 only (evaluates if Response 1 is relevant)
Window 2: Messages 1-2 (evaluates if Response 2 is relevant given context)
Window 3: Messages 1-3 (evaluates if Response 3 is relevant given context)
Window 4: Messages 2-4 (evaluates if Response 4 is relevant given context)
Window 5: Messages 3-5 (evaluates if Response 5 is relevant given context)

Each window evaluates whether the LAST assistant response in that window is relevant. The final score is:

Score = Number of Relevant Windows / Total Number of Windows

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

README.md

Conversation Relevance Example

What is Conversation Relevance?

Running the Example

Example Test Cases

1. Single-turn Evaluation

2. Multi-turn Travel Conversation

3. Conversation with Irrelevant Response

4. Technical Support Conversation

Configuration Options

Interpreting Results

Tips

How Scoring Works

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida / promptfoo mirror of https://github.com/promptfoo/promptfoo

README.md

Conversation Relevance Example

What is Conversation Relevance?

Running the Example

Example Test Cases

1. Single-turn Evaluation

2. Multi-turn Travel Conversation

3. Conversation with Irrelevant Response

4. Technical Support Conversation

Configuration Options

Interpreting Results

Tips

How Scoring Works

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida
/
promptfoo
mirror of https://github.com/promptfoo/promptfoo