Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Michael a072e4613d
docs(site): add guide on humanity's last exam (#4694)
1 month ago
..
a072e4613d
docs(site): add guide on humanity's last exam (#4694)
1 month ago
a072e4613d
docs(site): add guide on humanity's last exam (#4694)
1 month ago
a072e4613d
docs(site): add guide on humanity's last exam (#4694)
1 month ago

README.md

You have to be logged in to leave a comment. Sign In

huggingface-hle

Evaluate LLMs against Humanity's Last Exam (HLE), a challenging benchmark created by 1,000+ experts across 500+ institutions. HLE features 3,000+ questions spanning 100+ subjects, designed to push AI capabilities to their limits.

📖 Read the complete HLE benchmark guide →

You can run this example with:

npx promptfoo@latest init --example huggingface-hle

Prerequisites

  • OpenAI API key set as OPENAI_API_KEY
  • Anthropic API key set as ANTHROPIC_API_KEY
  • Hugging Face access token (required for dataset access)

Setup

Set your Hugging Face token:

export HF_TOKEN=your_token_here

Or add it to your .env file:

HF_TOKEN=your_token_here

Get your token at huggingface.co/settings/tokens.

Run the Evaluation

Run the evaluation:

npx promptfoo@latest eval

View results:

npx promptfoo@latest view

What's Tested

This evaluation tests models on:

  • Advanced mathematics and sciences
  • Humanities and social sciences
  • Professional domain knowledge
  • Multimodal reasoning
  • Interdisciplinary topics

Each question is evaluated for accuracy using an LLM judge that compares the model's response against the verified correct answer.

Current AI Performance

HLE is designed to be extremely challenging. Recent model performance:

  • OpenAI Deep Research: 26.6% accuracy
  • o4-mini: ~13% accuracy
  • DeepSeek-R1: 8.5% accuracy

Low scores are expected - this benchmark represents the cutting edge of AI evaluation.

Customization

Test More Questions

Increase the sample size:

tests:
  - huggingface://datasets/cais/hle?split=test&limit=100

Add More Models

Compare multiple providers:

providers:
  - anthropic:claude-sonnet-4-20250514
  - openai:o4-mini
  - deepseek:deepseek-reasoner

Different Prompting

Try alternative prompting strategies by modifying prompt.py or using static prompts:

prompts:
  - 'Answer this question step by step: {{question}}'
  - file://prompt.py:create_hle_prompt

Resources

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...