...

README.md

a072e4613d

docs(site): add guide on humanity's last exam (#4694)

1 month ago

prompt.py

a072e4613d

docs(site): add guide on humanity's last exam (#4694)

1 month ago

promptfooconfig.yaml

a072e4613d

docs(site): add guide on humanity's last exam (#4694)

1 month ago

You have to be logged in to leave a comment.

huggingface-hle

Evaluate LLMs against Humanity's Last Exam (HLE), a challenging benchmark created by 1,000+ experts across 500+ institutions. HLE features 3,000+ questions spanning 100+ subjects, designed to push AI capabilities to their limits.

📖 Read the complete HLE benchmark guide →

You can run this example with:

npx promptfoo@latest init --example huggingface-hle

Prerequisites

OpenAI API key set as OPENAI_API_KEY
Anthropic API key set as ANTHROPIC_API_KEY
Hugging Face access token (required for dataset access)

Setup

Set your Hugging Face token:

export HF_TOKEN=your_token_here

Or add it to your .env file:

HF_TOKEN=your_token_here

Get your token at huggingface.co/settings/tokens.

Run the Evaluation

Run the evaluation:

npx promptfoo@latest eval

View results:

npx promptfoo@latest view

What's Tested

This evaluation tests models on:

Advanced mathematics and sciences
Humanities and social sciences
Professional domain knowledge
Multimodal reasoning
Interdisciplinary topics

Each question is evaluated for accuracy using an LLM judge that compares the model's response against the verified correct answer.

Current AI Performance

HLE is designed to be extremely challenging. Recent model performance:

OpenAI Deep Research: 26.6% accuracy
o4-mini: ~13% accuracy
DeepSeek-R1: 8.5% accuracy

Low scores are expected - this benchmark represents the cutting edge of AI evaluation.

Customization

Test More Questions

Increase the sample size:

tests:
  - huggingface://datasets/cais/hle?split=test&limit=100

Add More Models

Compare multiple providers:

providers:
  - anthropic:claude-sonnet-4-20250514
  - openai:o4-mini
  - deepseek:deepseek-reasoner

Different Prompting

Try alternative prompting strategies by modifying prompt.py or using static prompts:

prompts:
  - 'Answer this question step by step: {{question}}'
  - file://prompt.py:create_hle_prompt

Resources

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

README.md

huggingface-hle

Prerequisites

Setup

Run the Evaluation

What's Tested

Current AI Performance

Customization

Test More Questions

Add More Models

Different Prompting

Resources

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida / promptfoo mirror of https://github.com/promptfoo/promptfoo

README.md

huggingface-hle

Prerequisites

Setup

Run the Evaluation

What's Tested

Current AI Performance

Customization

Test More Questions

Add More Models

Different Prompting

Resources

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida
/
promptfoo
mirror of https://github.com/promptfoo/promptfoo