Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Michael c514791528
chore(deps): update all example dependencies to latest versions (#4900)
1 month ago
..
360d7b519d
docs(examples): add uniform init commands to all example READMEs (#4068)
3 months ago
4426dc8290
docs: add F-score example (#2198)
9 months ago
4426dc8290
docs: add F-score example (#2198)
9 months ago
fc4a6bbc7d
docs(examples): update model references from gpt-4o-mini to gpt-4.1-mini (#4289)
2 months ago
c514791528
chore(deps): update all example dependencies to latest versions (#4900)
1 month ago

README.md

You have to be logged in to leave a comment. Sign In

f-score (F-Score HuggingFace Dataset Sentiment Analysis Eval)

You can run this example with:

npx promptfoo@latest init --example f-score

This project evaluates GPT-4o-mini's zero-shot performance on IMDB movie review sentiment analysis using promptfoo. Each model response includes:

  • Sentiment classification
  • Confidence score (1-10)
  • Reasoning for the classification

Quick Start

Set your OpenAI API key and run the evaluation:

promptfoo eval

Dataset

The evaluation uses the IMDB dataset from HuggingFace's datasets library, sampled to 100 reviews. The dataset is preprocessed into a CSV with two columns:

  • text: The movie review content
  • sentiment: The label ("positive" or "negative")

To modify the sample size or generate a new dataset, you can use prepare_data.py. First, install the Python dependencies:

pip install -r requirements.txt

Then run the preparation script:

python prepare_data.py

Metrics Overview

The evaluation implements F-score and related metrics using promptfoo's assertion system:

  1. Base Metrics calculated for each test case using JavaScript assertions:
- type: javascript
  value: "output.sentiment === 'positive' && context.vars.sentiment === 'positive' ? 1 : 0"
  metric: true_positives
  1. Derived Metrics calculated from base metrics after the evaluation completes:
- name: precision
  value: true_positives / (true_positives + false_positives)

- name: f1_score
  value: 2 * true_positives / (2 * true_positives + false_positives + false_negatives)

The evaluation tracks:

  • True/False Positives/Negatives: Base metrics for classification
  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1 Score: 2 × (precision × recall) / (precision + recall)
  • Accuracy: (TP + TN) / Total
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...