Are you sure you want to delete this access key?
You can run this example with:
npx promptfoo@latest init --example openai-gpt-4.1-vs-gpt-4o-mmlu
This example demonstrates how to benchmark OpenAI's GPT-4.1 against GPT-4o using the Massive Multitask Language Understanding (MMLU) benchmark, focusing on reasoning-heavy academic subjects.
npm install -g promptfoo
or brew install promptfoo
)OPENAI_API_KEY
To access the MMLU dataset, you'll need to authenticate with Hugging Face:
Create a Hugging Face account at huggingface.co if you don't have one
Generate an access token at huggingface.co/settings/tokens
Set your token as an environment variable:
export HF_TOKEN=your_token_here
Or add it to your .env
file:
HF_TOKEN=your_token_here
Get a local copy of the promptfooconfig.
You can clone this repository and from the root directory run:
cd examples/openai-gpt-4.1-vs-gpt-4o-mmlu
or you can get the example with:
promptfoo init --example openai-gpt-4.1-vs-gpt-4o-mmlu
Run the evaluation:
promptfoo eval
View the results in a web interface:
promptfoo view
This comparison evaluates both models on reasoning tasks from the MMLU benchmark, specifically:
Each subject uses 10 questions to keep the test manageable. You can edit this in promptfooconfig.yaml
.
Based on OpenAI's benchmark results, you should observe:
GPT-4.1 should demonstrate superior performance in:
The configuration in promptfooconfig.yaml
:
You can modify the test by editing promptfooconfig.yaml
:
Add more MMLU subjects:
tests:
# STEM subjects
- huggingface://datasets/cais/mmlu?split=test&subset=physics&limit=10
- huggingface://datasets/cais/mmlu?split=test&subset=chemistry&limit=10
- huggingface://datasets/cais/mmlu?split=test&subset=biology&limit=10
# Humanities
- huggingface://datasets/cais/mmlu?split=test&subset=world_history&limit=10
- huggingface://datasets/cais/mmlu?split=test&subset=philosophy&limit=10
# Professional domains
- huggingface://datasets/cais/mmlu?split=test&subset=jurisprudence&limit=10
- huggingface://datasets/cais/mmlu?split=test&subset=clinical_knowledge&limit=10
Try different prompting strategies:
prompts:
# Zero-shot with step-by-step reasoning (default)
- |
You are an expert test taker. Please solve the following multiple choice question step by step.
Question: {{question}}
Options:
A) {{choices[0]}}
B) {{choices[1]}}
C) {{choices[2]}}
D) {{choices[3]}}
Think through this step by step, then provide your final answer in the format "Therefore, the answer is A/B/C/D."
# Direct answer approach
- |
Question: {{question}}
A) {{choices[0]}}
B) {{choices[1]}}
C) {{choices[2]}}
D) {{choices[3]}}
Answer with just the letter (A/B/C/D) of the correct option.
Change the number of questions:
tests:
- huggingface://datasets/cais/mmlu?split=test&subset=abstract_algebra&limit=20 # Test 20 questions per subject
Adjust model parameters:
providers:
- id: openai:gpt-4.1
config:
temperature: 0.0 # Even more deterministic
max_tokens: 1000 # Allow longer explanations
Modify quality requirements:
defaultTest:
assert:
- type: latency
threshold: 30000 # Stricter 30-second timeout
- type: llm-rubric
value: Response demonstrates clear mathematical reasoning
GPT-4.1 offers better performance at lower cost:
Press p or to see the previous file or, n or to see the next file
Browsing data directories saved to S3 is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with AWS S3!
Are you sure you want to delete this access key?
Browsing data directories saved to Google Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Google Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to Azure Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Azure Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to S3 compatible storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with your S3 compatible storage!
Are you sure you want to delete this access key?