Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Michael ee85b4a415
docs: add bert-score example (#5091)
3 weeks ago
..
ee85b4a415
docs: add bert-score example (#5091)
3 weeks ago
ee85b4a415
docs: add bert-score example (#5091)
3 weeks ago
ee85b4a415
docs: add bert-score example (#5091)
3 weeks ago
ee85b4a415
docs: add bert-score example (#5091)
3 weeks ago
ee85b4a415
docs: add bert-score example (#5091)
3 weeks ago

README.md

You have to be logged in to leave a comment. Sign In

bert-score (BERTScore Evaluation)

Use BERTScore to measure semantic similarity between LLM outputs and reference text.

npx promptfoo@latest init --example bert-score

Setup

pip install -r requirements.txt

Note: First run will download the BERT model (~1.4GB).

Usage

Basic Example

# promptfooconfig.yaml
tests:
  - vars:
      text: 'Hello world'
      reference: 'Hi there'
    assert:
      - type: python
        value: file://bertscore_check.py
        threshold: 0.7 # Pass if similarity > 70%

Run: promptfoo eval

Advanced Example

Compare against multiple valid references:

# promptfooconfig-advanced.yaml
assert:
  - type: python
    value: |
      from bert_score import score
      references = [
          "First valid answer",
          "Second valid answer",
          "Third valid answer"
      ]
      scores = []
      for ref in references:
          _, _, F1 = score([output], [ref], lang='en', verbose=False)
          scores.append(F1.item())
      return max(scores)  # Use best match

Run: promptfoo eval -c promptfooconfig-advanced.yaml

How It Works

BERTScore returns a similarity score from 0 to 1:

  • 0.9+ = Nearly identical meaning
  • 0.7-0.9 = Similar meaning
  • <0.7 = Different meaning

Learn more

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...