...

README.md

360d7b519d

docs(examples): add uniform init commands to all example READMEs (#4068)

3 months ago

imdb_eval_sample.csv

4426dc8290

docs: add F-score example (#2198)

9 months ago

prepare_data.py

4426dc8290

docs: add F-score example (#2198)

9 months ago

promptfooconfig.yaml

fc4a6bbc7d

docs(examples): update model references from gpt-4o-mini to gpt-4.1-mini (#4289)

2 months ago

requirements.txt

c514791528

chore(deps): update all example dependencies to latest versions (#4900)

1 month ago

You have to be logged in to leave a comment.

f-score (F-Score HuggingFace Dataset Sentiment Analysis Eval)

You can run this example with:

npx promptfoo@latest init --example f-score

This project evaluates GPT-4o-mini's zero-shot performance on IMDB movie review sentiment analysis using promptfoo. Each model response includes:

Sentiment classification
Confidence score (1-10)
Reasoning for the classification

Quick Start

Set your OpenAI API key and run the evaluation:

promptfoo eval

Dataset

The evaluation uses the IMDB dataset from HuggingFace's datasets library, sampled to 100 reviews. The dataset is preprocessed into a CSV with two columns:

text: The movie review content
sentiment: The label ("positive" or "negative")

To modify the sample size or generate a new dataset, you can use prepare_data.py. First, install the Python dependencies:

pip install -r requirements.txt

Then run the preparation script:

python prepare_data.py

Metrics Overview

The evaluation implements F-score and related metrics using promptfoo's assertion system:

Base Metrics calculated for each test case using JavaScript assertions:

- type: javascript
  value: "output.sentiment === 'positive' && context.vars.sentiment === 'positive' ? 1 : 0"
  metric: true_positives

Derived Metrics calculated from base metrics after the evaluation completes:

- name: precision
  value: true_positives / (true_positives + false_positives)

- name: f1_score
  value: 2 * true_positives / (2 * true_positives + false_positives + false_negatives)

The evaluation tracks:

True/False Positives/Negatives: Base metrics for classification
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1 Score: 2 × (precision × recall) / (precision + recall)
Accuracy: (TP + TN) / Total

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

README.md

f-score (F-Score HuggingFace Dataset Sentiment Analysis Eval)

Quick Start

Dataset

Metrics Overview

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida / promptfoo mirror of https://github.com/promptfoo/promptfoo

README.md

f-score (F-Score HuggingFace Dataset Sentiment Analysis Eval)

Quick Start

Dataset

Metrics Overview

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida
/
promptfoo
mirror of https://github.com/promptfoo/promptfoo