...

README.md

360d7b519d

docs(examples): add uniform init commands to all example READMEs (#4068)

3 months ago

promptfooconfig.yaml

1f15e05d88

docs: add schema header to example configs (#4277)

2 months ago

You have to be logged in to leave a comment.

deepseek-r1-vs-openai-o1 (DeepSeek-R1 vs OpenAI o1 Comparison)

You can run this example with:

npx promptfoo@latest init --example deepseek-r1-vs-openai-o1

This example demonstrates how to benchmark DeepSeek's R1 model against OpenAI's o1 model using the Massive Multitask Language Understanding (MMLU) benchmark, focusing on reasoning-heavy subjects.

Prerequisites

promptfoo CLI installed (npm install -g promptfoo or brew install promptfoo)
OpenAI API key set as OPENAI_API_KEY
DeepSeek API key set as DEEPSEEK_API_KEY
Hugging Face account and access token (for MMLU dataset)

Hugging Face Authentication

To access the MMLU dataset, you'll need to authenticate with Hugging Face:

Create a Hugging Face account at huggingface.co if you don't have one
Generate an access token at huggingface.co/settings/tokens
Set your token as an environment variable:
```
export HF_TOKEN=your_token_here
```
Or add it to your .env file:
```
HF_TOKEN=your_token_here
```

Running the Eval

Get a local copy of the promptfooconfig.

You can clone this repository and from the root directory run:
```
cd examples/deepseek-r1-vs-openai-o1
```
or you can get the example with:
```
promptfoo init --example deepseek-r1-vs-openai-o1
```
Run the evaluation:
```
promptfoo eval
```
View the results in a web interface:
```
promptfoo view
```

What's Being Tested

This comparison evaluates both models on reasoning tasks from the MMLU benchmark, specifically:

Abstract Algebra: Advanced mathematical reasoning
Formal Logic: Logical statement analysis
High School Mathematics: Core problem-solving
College Mathematics: Advanced mathematical concepts
Logical Fallacies: Flaw identification in reasoning

Each subject uses 10 questions to keep the test manageable. You can edit this in promptfooconfig.yaml.

Test Structure

The configuration in promptfooconfig.yaml:

Prompt Template: Encourages step-by-step reasoning for multiple choice questions
Quality Checks:
- 60-second timeout per question
- Required step-by-step reasoning
- Clear final answer format
Evaluation Metrics:
- Accuracy
- Reasoning quality
- Response time
- Format adherence

Customizing

You can modify the test by editing promptfooconfig.yaml:

Add more MMLU subjects:

tests:
  - huggingface://datasets/cais/mmlu?split=test&subset=physics

Try different prompting strategies:

prompts:
  # Zero-shot with step-by-step reasoning (default)
  - |
    You are an expert test taker. Please solve the following multiple choice question step by step.

    Question: {{question}}

    Options:
    A) {{choices[0]}}
    B) {{choices[1]}}
    C) {{choices[2]}}
    D) {{choices[3]}}

    Think through this step by step, then provide your final answer in the format "Therefore, the answer is A/B/C/D."

  # Zero-shot with direct answer
  - |
    Question: {{question}}

    A) {{choices[0]}}
    B) {{choices[1]}}
    C) {{choices[2]}}
    D) {{choices[3]}}

    Answer with just the letter (A/B/C/D) of the correct option.

Change the number of questions:

tests:
  - huggingface://datasets/cais/mmlu?split=test&subset=physics&limit=20 # Test 20 questions per subject

Adjust quality requirements:

defaultTest:
  assert:
    - type: latency
      threshold: 30000 # Stricter 30-second timeout

Additional Resources

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

README.md

deepseek-r1-vs-openai-o1 (DeepSeek-R1 vs OpenAI o1 Comparison)

Prerequisites

Hugging Face Authentication

Running the Eval

What's Being Tested

Test Structure

Customizing

Additional Resources

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida / promptfoo mirror of https://github.com/promptfoo/promptfoo

README.md

deepseek-r1-vs-openai-o1 (DeepSeek-R1 vs OpenAI o1 Comparison)

Prerequisites

Hugging Face Authentication

Running the Eval

What's Being Tested

Test Structure

Customizing

Additional Resources

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida
/
promptfoo
mirror of https://github.com/promptfoo/promptfoo