Are you sure you want to delete this access key?
Evaluate LLMs against Humanity's Last Exam (HLE), a challenging benchmark created by 1,000+ experts across 500+ institutions. HLE features 3,000+ questions spanning 100+ subjects, designed to push AI capabilities to their limits.
📖 Read the complete HLE benchmark guide →
You can run this example with:
npx promptfoo@latest init --example huggingface-hle
OPENAI_API_KEY
ANTHROPIC_API_KEY
Set your Hugging Face token:
export HF_TOKEN=your_token_here
Or add it to your .env
file:
HF_TOKEN=your_token_here
Get your token at huggingface.co/settings/tokens.
Run the evaluation:
npx promptfoo@latest eval
View results:
npx promptfoo@latest view
This evaluation tests models on:
Each question is evaluated for accuracy using an LLM judge that compares the model's response against the verified correct answer.
HLE is designed to be extremely challenging. Recent model performance:
Low scores are expected - this benchmark represents the cutting edge of AI evaluation.
Increase the sample size:
tests:
- huggingface://datasets/cais/hle?split=test&limit=100
Compare multiple providers:
providers:
- anthropic:claude-sonnet-4-20250514
- openai:o4-mini
- deepseek:deepseek-reasoner
Try alternative prompting strategies by modifying prompt.py
or using static prompts:
prompts:
- 'Answer this question step by step: {{question}}'
- file://prompt.py:create_hle_prompt
Press p or to see the previous file or, n or to see the next file
Browsing data directories saved to S3 is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with AWS S3!
Are you sure you want to delete this access key?
Browsing data directories saved to Google Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Google Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to Azure Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Azure Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to S3 compatible storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with your S3 compatible storage!
Are you sure you want to delete this access key?