Are you sure you want to delete this access key?
This example measures the political bias of Grok 4 compared to other major AI models using a comprehensive dataset of 2,500 political opinion questions, including specific questions designed to detect corporate bias in AI responses.
📖 Read the full analysis: Grok 4 Goes Red? Yes, But Not How You Think
You can run this example with:
npx promptfoo@latest init --example grok-4-political-bias
This example requires the following environment variables:
XAI_API_KEY
- Your xAI API key for Grok 4GOOGLE_API_KEY
- Your Google API key for Gemini 2.5 ProOPENAI_API_KEY
- Your OpenAI API key for GPT-4.1ANTHROPIC_API_KEY
- Your Anthropic API key for Claude Opus 4You can set these in a .env
file or directly in your environment.
export XAI_API_KEY="your_xai_api_key"
export GOOGLE_API_KEY="your_google_api_key"
export OPENAI_API_KEY="your_openai_api_key"
export ANTHROPIC_API_KEY="your_anthropic_api_key"
# Full evaluation with all models
npx promptfoo@latest eval -c promptfooconfig.yaml --output results.json
# Multi-judge analysis (4 models × 4 judges)
npx promptfoo@latest eval -c promptfooconfig-multi-judge.yaml --output results-multi-judge.json
# View results in the web UI
npx promptfoo@latest view
# Generate analysis charts
python analyze_results_multi_judge.py
python generate_political_spectrum_chart.py
The experiment reveals:
political-questions.csv
- 2,500 political questions covering:
promptfooconfig.yaml
- Main configuration for basic evaluationpromptfooconfig-multi-judge.yaml
- Multi-judge configuration (4×4 matrix)political-bias-rubric.yaml
- 7-point Likert scale rubric for political scoringanalyze_results_multi_judge.py
- Comprehensive multi-judge analysisgenerate_political_spectrum_chart.py
- Creates political positioning visualizationsanalyze_corporate_bias.py
- Analyzes responses to Musk-related questionsextract_high_variance.py
- Finds questions with highest judge disagreementresults.json
- Full evaluation results from the basic experimentresults-multi-judge.json
- Results from multi-judge analysis (4×4 matrix)Each model response is scored on a 0-1 scale:
The analysis includes:
Running the full experiment:
For testing with smaller samples:
# Test with 100 questions
head -101 political-questions.csv > test-100.csv
# Test economic questions only
grep ",economic$" political-questions.csv > economic-only.csv
# Test social questions only
grep ",social$" political-questions.csv > social-only.csv
# Use rate limiting
npx promptfoo@latest eval -c promptfooconfig.yaml --max-concurrency 5
Edit configuration files to:
Press p or to see the previous file or, n or to see the next file
Browsing data directories saved to S3 is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with AWS S3!
Are you sure you want to delete this access key?
Browsing data directories saved to Google Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Google Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to Azure Cloud Storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with Azure Cloud Storage!
Are you sure you want to delete this access key?
Browsing data directories saved to S3 compatible storage is possible with DAGsHub. Let's configure your repository to easily display your data in the context of any commit!
promptfoo is now integrated with your S3 compatible storage!
Are you sure you want to delete this access key?