Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel
Michael 1f15e05d88
docs: add schema header to example configs (#4277)
2 months ago
..
360d7b519d
docs(examples): add uniform init commands to all example READMEs (#4068)
3 months ago
1f15e05d88
docs: add schema header to example configs (#4277)
2 months ago

README.md

You have to be logged in to leave a comment. Sign In

deepseek-r1-vs-openai-o1 (DeepSeek-R1 vs OpenAI o1 Comparison)

You can run this example with:

npx promptfoo@latest init --example deepseek-r1-vs-openai-o1

This example demonstrates how to benchmark DeepSeek's R1 model against OpenAI's o1 model using the Massive Multitask Language Understanding (MMLU) benchmark, focusing on reasoning-heavy subjects.

Prerequisites

  • promptfoo CLI installed (npm install -g promptfoo or brew install promptfoo)
  • OpenAI API key set as OPENAI_API_KEY
  • DeepSeek API key set as DEEPSEEK_API_KEY
  • Hugging Face account and access token (for MMLU dataset)

Hugging Face Authentication

To access the MMLU dataset, you'll need to authenticate with Hugging Face:

  1. Create a Hugging Face account at huggingface.co if you don't have one

  2. Generate an access token at huggingface.co/settings/tokens

  3. Set your token as an environment variable:

    export HF_TOKEN=your_token_here
    

    Or add it to your .env file:

    HF_TOKEN=your_token_here
    

Running the Eval

  1. Get a local copy of the promptfooconfig.

    You can clone this repository and from the root directory run:

    cd examples/deepseek-r1-vs-openai-o1
    

    or you can get the example with:

    promptfoo init --example deepseek-r1-vs-openai-o1
    
  2. Run the evaluation:

    promptfoo eval
    
  3. View the results in a web interface:

    promptfoo view
    

What's Being Tested

This comparison evaluates both models on reasoning tasks from the MMLU benchmark, specifically:

  1. Abstract Algebra: Advanced mathematical reasoning
  2. Formal Logic: Logical statement analysis
  3. High School Mathematics: Core problem-solving
  4. College Mathematics: Advanced mathematical concepts
  5. Logical Fallacies: Flaw identification in reasoning

Each subject uses 10 questions to keep the test manageable. You can edit this in promptfooconfig.yaml.

Test Structure

The configuration in promptfooconfig.yaml:

  1. Prompt Template: Encourages step-by-step reasoning for multiple choice questions
  2. Quality Checks:
    • 60-second timeout per question
    • Required step-by-step reasoning
    • Clear final answer format
  3. Evaluation Metrics:
    • Accuracy
    • Reasoning quality
    • Response time
    • Format adherence

Customizing

You can modify the test by editing promptfooconfig.yaml:

  1. Add more MMLU subjects:

    tests:
      - huggingface://datasets/cais/mmlu?split=test&subset=physics
    
  2. Try different prompting strategies:

    prompts:
      # Zero-shot with step-by-step reasoning (default)
      - |
        You are an expert test taker. Please solve the following multiple choice question step by step.
    
        Question: {{question}}
    
        Options:
        A) {{choices[0]}}
        B) {{choices[1]}}
        C) {{choices[2]}}
        D) {{choices[3]}}
    
        Think through this step by step, then provide your final answer in the format "Therefore, the answer is A/B/C/D."
    
      # Zero-shot with direct answer
      - |
        Question: {{question}}
    
        A) {{choices[0]}}
        B) {{choices[1]}}
        C) {{choices[2]}}
        D) {{choices[3]}}
    
        Answer with just the letter (A/B/C/D) of the correct option.
    
  3. Change the number of questions:

    tests:
      - huggingface://datasets/cais/mmlu?split=test&subset=physics&limit=20 # Test 20 questions per subject
    
  4. Adjust quality requirements:

    defaultTest:
      assert:
        - type: latency
          threshold: 30000 # Stricter 30-second timeout
    

Additional Resources

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...