Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Michael 4426dc8290
docs: add F-score example (#2198)
9 months ago
..
4426dc8290
docs: add F-score example (#2198)
9 months ago
4426dc8290
docs: add F-score example (#2198)
9 months ago
4426dc8290
docs: add F-score example (#2198)
9 months ago
4426dc8290
docs: add F-score example (#2198)
9 months ago
4426dc8290
docs: add F-score example (#2198)
9 months ago

README.md

You have to be logged in to leave a comment. Sign In

F-Score HuggingFace Dataset Sentiment Analysis Eval

This project evaluates GPT-4o-mini's zero-shot performance on IMDB movie review sentiment analysis using promptfoo. Each model response includes:

  • Sentiment classification
  • Confidence score (1-10)
  • Reasoning for the classification

Quick Start

Set your OpenAI API key and run the evaluation:

promptfoo eval

Dataset

The evaluation uses the IMDB dataset from HuggingFace's datasets library, sampled to 100 reviews. The dataset is preprocessed into a CSV with two columns:

  • text: The movie review content
  • sentiment: The label ("positive" or "negative")

To modify the sample size or generate a new dataset, you can use prepare_data.py. First, install the Python dependencies:

pip install -r requirements.txt

Then run the preparation script:

python prepare_data.py

Metrics Overview

The evaluation implements F-score and related metrics using promptfoo's assertion system:

  1. Base Metrics calculated for each test case using JavaScript assertions:
- type: javascript
  value: "output.sentiment === 'positive' && context.vars.sentiment === 'positive' ? 1 : 0"
  metric: true_positives
  1. Derived Metrics calculated from base metrics after the evaluation completes:
- name: precision
  value: true_positives / (true_positives + false_positives)

- name: f1_score
  value: 2 * true_positives / (2 * true_positives + false_positives + false_negatives)

The evaluation tracks:

  • True/False Positives/Negatives: Base metrics for classification
  • Precision: TP / (TP + FP)
  • Recall: TP / (TP + FN)
  • F1 Score: 2 × (precision × recall) / (precision + recall)
  • Accuracy: (TP + TN) / Total
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...