Are you sure you want to delete this access key?
title | description | image | keywords | sidebar_label |
---|---|---|---|---|
Recreating Mistral Magistral AIME2024 Benchmarks in promptfoo | Reproduce Mistral's Magistral 73.6% AIME2024 mathematical reasoning benchmark using promptfoo with a simple evaluation setup comparing Magistral Medium vs Small. | /img/docs/mistral-magistral-aime2024-evaluation-results.png | [mistral magistral aime2024 mathematical reasoning benchmark mistral comparison promptfoo evaluation] | Magistral AIME2024 Benchmark |
Mistral's Magistral models achieved 73.6% on AIME2024 (Medium) and 70.7% (Small) on mathematical reasoning problems. This guide shows you how to reproduce these benchmark results using promptfoo.
:::tip Quick Start
npx promptfoo@latest init --example mistral
npx promptfoo@latest eval -c mistral/promptfooconfig.aime2024.yaml
:::
Mistral's published results:
Note: Our evaluation calls each model once per problem. Mistral's highest scores used majority voting across 64 attempts - running the same problem 64 times and taking the most common answer.
export MISTRAL_API_KEY=your_key
export HF_TOKEN=your_token
(get one here)# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: Reproduce Mistral Magistral AIME2024 benchmark
prompts:
- |
Solve this AIME mathematical problem step by step.
Problem: {{question}}
Think through this carefully and provide your final answer as a 3-digit integer (000-999).
End with: "Therefore, the answer is [your answer]."
providers:
- id: mistral:magistral-medium-latest
label: Magistral Medium
config:
temperature: 0.7
top_p: 0.95
max_tokens: 40960
- id: mistral:magistral-small-latest
label: Magistral Small
config:
temperature: 0.7
top_p: 0.95
max_tokens: 40960
tests:
- huggingface://datasets/sea-snell/aime-2024?split=test
defaultTest:
assert:
- type: llm-rubric
value: |
Evaluate this mathematical solution to an AIME competition problem.
The correct answer is: {{answer}}
Grade as PASS if and only if:
1. The response shows clear step-by-step mathematical reasoning
2. The final answer presented equals {{answer}} exactly
3. The mathematical work supports the conclusion
Grade as FAIL if the final answer is incorrect, regardless of the reasoning quality.
This configuration demonstrates several key promptfoo concepts:
Prompts: The prompt template includes a {{question}}
variable that gets populated from the dataset. You can modify this prompt to test different reasoning approaches - for example, you might add "Show your work clearly" or "Use multiple solution methods."
Providers: We're comparing two Mistral models with identical settings. The max_tokens: 40960
allows for extended reasoning traces - crucial for complex math problems.
Tests: The huggingface://
integration automatically loads the AIME2024 dataset. Each test case provides variables like {{question}}
and {{answer}}
to your prompts and assertions.
LLM Rubric: The llm-rubric
assertion uses an LLM to evaluate responses based on your criteria. This is more flexible than exact string matching - it can assess both mathematical correctness and reasoning quality. The rubric gets access to both the model's response and the correct {{answer}}
from the dataset.
:::tip Customize Your Evaluation
Try modifying the prompt to test different approaches:
prompts:
- 'Solve step-by-step: {{question}}'
- 'Use multiple methods to solve: {{question}}'
- 'Explain your reasoning clearly: {{question}}'
:::
npx promptfoo@latest eval
npx promptfoo@latest view
You should see results showing Magistral Medium outperforming Magistral Small on mathematical reasoning, with detailed step-by-step solutions using the full 40k token context.
With single evaluations and strict answer matching (vs Mistral's 64-vote majority):
The evaluation process:
Dataset Loading: promptfoo automatically downloads the AIME2024 dataset from Hugging Face, which contains 30 mathematical problems with verified answers.
Prompt Injection: Each problem's question
gets inserted into your prompt template, and the model generates a solution.
LLM-Based Grading: Instead of simple string matching, the llm-rubric
uses an LLM evaluator to assess whether the response demonstrates correct mathematical reasoning and arrives at the right answer.
Results Aggregation: promptfoo calculates pass rates, shows individual responses, and highlights where each model succeeded or failed.
The LLM rubric is particularly important here because mathematical solutions can be expressed in many ways - the evaluator can recognize correct math even if the formatting varies.
:::tip Alternative Evaluation Methods
For deterministic evaluation, you can replace the LLM rubric with exact matching:
defaultTest:
assert:
- type: javascript
value: |
// Extract final answer from response
const match = output.match(/answer is (\d{3})/i);
const modelAnswer = match ? match[1] : '';
return modelAnswer === context.vars.answer;
Or use regex matching:
defaultTest:
assert:
- type: regex
value: 'answer is {{answer}}'
:::
The American Invitational Mathematics Examination (AIME) is a prestigious mathematics competition where:
:::note AIME 2024 Problem 4
Let x
, y
, and z
be positive real numbers that satisfy the following system of equations:
log₂(x/yz) = 1/2log₂(y/xz) = 1/3log₂(z/xy) = 1/4
Then the value of |log₂(x⁴y³z²)|
is m/n
where m
and n
are relatively prime positive integers. Find m+n
.
Answer: 33
:::
This problem requires logarithmic manipulation, algebraic substitution, and multi-step verification to reach the final answer.
To test more problems or reproduce the full benchmark:
tests:
# Test all 30 AIME2024 problems
- huggingface://datasets/sea-snell/aime-2024?split=test
To implement majority voting like Mistral's 90% result:
providers:
- id: mistral:magistral-medium-latest
config:
temperature: 0.7 # Add randomness for diverse attempts
max_tokens: 40960
# Run multiple times and aggregate results manually
The complete example is available in our repository:
npx promptfoo@latest init --example mistral
cd mistral
npx promptfoo@latest eval -c promptfooconfig.aime2024.yaml
Press p or to see the previous file or, n or to see the next file
Browsing data directories saved to S3 is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with AWS S3!
Are you sure you want to delete this access key?
Browsing data directories saved to Google Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Google Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to Azure Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Azure Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to S3 compatible storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with your S3 compatible storage!
Are you sure you want to delete this access key?