Are you sure you want to delete this access key?
You can run this example with:
npx promptfoo@latest init --example deepseek-r1-vs-openai-o1
This example demonstrates how to benchmark DeepSeek's R1 model against OpenAI's o1 model using the Massive Multitask Language Understanding (MMLU) benchmark, focusing on reasoning-heavy subjects.
npm install -g promptfoo
or brew install promptfoo
)OPENAI_API_KEY
DEEPSEEK_API_KEY
To access the MMLU dataset, you'll need to authenticate with Hugging Face:
Create a Hugging Face account at huggingface.co if you don't have one
Generate an access token at huggingface.co/settings/tokens
Set your token as an environment variable:
export HF_TOKEN=your_token_here
Or add it to your .env
file:
HF_TOKEN=your_token_here
Get a local copy of the promptfooconfig.
You can clone this repository and from the root directory run:
cd examples/deepseek-r1-vs-openai-o1
or you can get the example with:
promptfoo init --example deepseek-r1-vs-openai-o1
Run the evaluation:
promptfoo eval
View the results in a web interface:
promptfoo view
This comparison evaluates both models on reasoning tasks from the MMLU benchmark, specifically:
Each subject uses 10 questions to keep the test manageable. You can edit this in promptfooconfig.yaml
.
The configuration in promptfooconfig.yaml
:
You can modify the test by editing promptfooconfig.yaml
:
Add more MMLU subjects:
tests:
- huggingface://datasets/cais/mmlu?split=test&subset=physics
Try different prompting strategies:
prompts:
# Zero-shot with step-by-step reasoning (default)
- |
You are an expert test taker. Please solve the following multiple choice question step by step.
Question: {{question}}
Options:
A) {{choices[0]}}
B) {{choices[1]}}
C) {{choices[2]}}
D) {{choices[3]}}
Think through this step by step, then provide your final answer in the format "Therefore, the answer is A/B/C/D."
# Zero-shot with direct answer
- |
Question: {{question}}
A) {{choices[0]}}
B) {{choices[1]}}
C) {{choices[2]}}
D) {{choices[3]}}
Answer with just the letter (A/B/C/D) of the correct option.
Change the number of questions:
tests:
- huggingface://datasets/cais/mmlu?split=test&subset=physics&limit=20 # Test 20 questions per subject
Adjust quality requirements:
defaultTest:
assert:
- type: latency
threshold: 30000 # Stricter 30-second timeout
Press p or to see the previous file or, n or to see the next file
Browsing data directories saved to S3 is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with AWS S3!
Are you sure you want to delete this access key?
Browsing data directories saved to Google Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Google Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to Azure Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Azure Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to S3 compatible storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with your S3 compatible storage!
Are you sure you want to delete this access key?