You have to be logged in to leave a comment.

title	sidebar_position	description	tags	keywords
Release Notes	100	Track monthly Promptfoo releases featuring new providers, security plugins, performance improvements, and community contributions	[releases changelog updates features]	[Promptfoo releases changelog updates features monthly summaries]

Release Notes

Full release history for Promptfoo open source can be found on GitHub.

July 2025 Release Highlights

This month we focused on expanding provider support, enhancing evaluation capabilities, and strengthening enterprise features to help you build more reliable and secure AI applications.

Evals

New Models / Providers

Expanded Provider Support

Docker Model Runner - Run models in isolated Docker containers for better security and reproducibility
MCP (Model Context Protocol) - Connect to MCP servers for enhanced AI capabilities
Google Imagen - Generate images for multimodal testing scenarios
AIMLAPI - Access various AI models through a unified interface

New Model Support

Grok-4 - Advanced reasoning capabilities from xAI
OpenAI Deep Research Models - o3-deep-research and o4-mini-deep-research for complex problem solving
Enhanced Azure Provider - Added system prompt support for better control

Enhanced Capabilities

LiteLLM Embeddings - Similarity testing and semantic search
Google Vision - Image understanding for multimodal evaluations
HTTP Provider Enhancements - Added support for JKS and PFX client certificates.
Browser Provider - Connect to existing Chrome browser sessions via Chrome DevTools Protocol (CDP) for testing OAuth-authenticated applications

Assertion Improvements

Context Transforms: Extract additional data from provider responses to use in assertions: context-based assertions. These are especially useful for evaluating RAG systems.
Finish Reason Validation: Use finish-reason as an option in assertions to validate how AI model responses are terminated. This is useful for checking if the model completed naturally, hit token limits, triggered content filters, or made tool calls as expected.
Tracing Assertions: Use your tracing and telemetry data in assertions: trace-span-count, trace-span-duration, and trace-error-spans

Other Features

External Test Configuration - defaultTest can now load test cases from external files for easier management

Developer Experience Improvements

Python Debugging - Use import pdb; pdb.set_trace() in executed third-party Python scripts for easier debugging
Enhanced Search - Comprehensive metadata filtering to search results with search operators (equals, contains, not contains) and persistent button actions

Web UI Improvements

Enhanced Eval Results Page

We've significantly improved the evaluation results interface to handle large-scale testing more effectively:

First-Class Zooming Support - Zoom in and out of the eval results table to see more data at once or focus on specific details. This is especially useful when working with evaluations containing hundreds or thousands of test cases.
Advanced Metadata Filtering - Filter results using powerful search operators (equals, contains, not contains) with persistent button actions. Click on any metric pill in the results to instantly apply it as a filter, making it easier to drill down into specific failure modes or success patterns.
Improved Pagination - Enhanced pagination controls with "go to" functionality and better handling of large result sets. The UI now maintains scroll position and filter state as you navigate between pages.
Multi-Metric Filtering - Apply multiple filters simultaneously to find exactly the results you're looking for. For red team evaluations, you can now filter by both plugin and strategy to analyze specific attack vectors.
Performance Optimizations - Fixed horizontal scrolling issues, improved rendering performance for large tables, and optimized memory usage when dealing with extensive evaluation results.

These improvements make it much easier to analyze and understand evaluation results, especially for large-scale red teaming exercises or comprehensive test suites.

Red Teaming

Enterprise Features

Regrade Red Team Scans - After adding grading rules, re-grade existing scans without re-running them. Once you've changed a grading system (pass/fail criteria, reasoning, etc.) you can re-grade existing eval results to measure the effect of those changes.
Identity Provider Integration - Map teams and roles from your Identity Provider to automatically assign permissions
MCP Proxy - Enterprise-grade security for MCP servers with access control and traffic monitoring for sensitive data

Strategies

New Agentic Multi-Turn Strategies

We've launched two powerful new agentic multi-turn red team strategies that adapt dynamically based on target responses:

Custom Strategy - Define your own red teaming strategies using natural language instructions. This groundbreaking feature lets you create sophisticated, domain-specific attack patterns without writing code. The AI agent interprets your instructions and executes multi-turn conversations tailored to your specific testing needs.
Mischievous User Strategy - Simulates an innocently mischievous user who plays subtle games with your AI agent through multi-turn conversations. This strategy uncovers vulnerabilities by mimicking real-world user behavior where users might push boundaries through playful or indirect approaches rather than direct attacks.

Both strategies leverage AI agents to conduct intelligent, adaptive conversations that evolve based on your system's responses, making them far more effective than static attack patterns.

Other Strategy Improvements

HTTP Target Improvements - Enhanced test button now provides detailed error diagnostics, automatic retry suggestions, and context-aware fixes for common configuration issues like authentication failures, CORS errors, and malformed requests

June 2025 Release Highlights

This month we focused on enhancing observability, expanding provider support, and strengthening red team capabilities to help you build more reliable and secure AI applications.

Evals

Tracing

See Inside Your LLM Applications with OpenTelemetry

We've added OpenTelemetry tracing support to help you understand what's happening inside your AI applications. Previously, LLM applications were often "black boxes"—you could see inputs and outputs, but not what happened in between. Now you can visualize the entire execution flow, measure performance of individual steps, and quickly identify issues.

This is especially valuable for complex RAG pipelines or multi-step workflows where you need to identify performance bottlenecks or debug failures.

Use it when:

Debugging slow RAG pipelines
Optimizing multi-step agent workflows
Understanding why certain requests fail
Measuring performance across different providers

New Models / Providers

Expanded Audio and Multimodal Capabilities

As AI applications increasingly use voice interfaces and visual content, you need tools to evaluate these capabilities just as rigorously as text-based interactions. We've significantly expanded support for audio and multimodal AI:

Google Live Audio - Full audio generation with features like:
- Voice selection and customization
- Affective dialog for more natural conversations
- Real-time transcription
- Support for Gemini 2.0 Flash and native audio models
Hyperbolic Provider - New support for Hyperbolic's image and audio models, providing more options for multimodal evaluations
Helicone AI Gateway - Route requests through Helicone for enhanced monitoring and analytics
Mistral Magistral - Added support for Mistral's latest reasoning models

Other Features

Static Model Scanning with ModelAudit

Supply chain attacks through compromised models are a growing threat. We've significantly enhanced our static model security scanner to help you verify model integrity before deployment, checking for everything from malicious pickle files to subtle statistical anomalies that might indicate trojaned models.

New Web Interface: ModelAudit now includes a visual UI accessible at /model-audit when running promptfoo view:

Visual file/directory selection with drag-and-drop support
Real-time scanning progress with live updates
Tabbed results display with severity color coding
Scan history tracking

Expanded Format Support:

SafeTensors - Support for Hugging Face's secure tensor format
HuggingFace URLs - Scan models directly from HuggingFace without downloading
Enhanced Binary Detection - Automatic format detection for .bin files (PyTorch, SafeTensors, etc.)
Weight Analysis - Statistical anomaly detection to identify potential backdoors

Security Improvements:

Better detection of embedded executables (Windows PE, Linux ELF, macOS Mach-O)
Path traversal protection in archives
License compliance checking with SBOM generation
Protection against zip bombs and decompression attacks

Developer Experience Improvements

Assertion Generation - Automatically generate test assertions based on your use cases, saving time in test creation
SQLite WAL Mode - Improved performance and reliability for local evaluations with better concurrent access
Enhanced Token Tracking - Per-provider token usage statistics help you monitor costs across different LLM providers
Evaluation Time Limits - New PROMPTFOO_MAX_EVAL_TIME_MS environment variable prevents runaway evaluations from consuming excessive resources
Custom Headers Support - Added support for custom headers in Azure and Google Gemini providers for enterprise authentication needs
WebSocket Header Support - Enhanced WebSocket providers with custom header capabilities

Red Teaming

Enterprise Features

Advanced Testing Capabilities for Teams

Generic attacks often miss system-specific vulnerabilities. We've added powerful features for organizations that need sophisticated AI security testing to create targeted tests that match your actual security risks:

Target Discovery Agent - Automatically analyzes your AI system to understand its capabilities and craft more effective, targeted attacks
Adaptive Red Team Strategies - Define complex multi-turn attack strategies with enhanced capabilities for targeted testing
Grader Customization - Fine-tune evaluation criteria at the plugin level with concrete examples for more accurate assessments
Cloud-based Plugin Severity Overrides - Enterprise users can centrally manage and customize severity levels for red team plugins across their organization

Plugins

Comprehensive Safety Testing for High-Stakes Domains

Different industries face unique AI risks. We've introduced specialized plugins for industries where AI errors have serious consequences, ensuring you're testing for the failures that matter most in your domain:

Medical Safety Testing

Medical Plugins detect critical healthcare risks:

Hallucination - Fabricated medical studies or drug interactions
Prioritization Errors - Dangerous mistakes in triage scenarios
Anchoring Bias - Fixation on initial symptoms while ignoring critical information
Sycophancy - Agreeing with incorrect medical assumptions from users

Financial Risk Detection

Financial Plugins identify domain-specific vulnerabilities:

Calculation Errors - Mistakes in financial computations
Compliance Violations - Regulatory breaches in advice or operations
Data Leakage - Exposure of confidential financial information
Hallucination - Fabricated market data or investment advice

Bias Detection Suite

Biased AI systems can perpetuate discrimination at scale. Our new comprehensive bias detection tests ensure your AI treats all users fairly and respectfully across:

Age - Ageism in hiring, healthcare, or service recommendations
Disability - Unfair assumptions about capabilities
Gender - Role stereotypes and differential treatment
Race - Ethnic stereotypes and discriminatory patterns

Enterprise-Grade Datasets

Aegis Dataset - NVIDIA's 26,000+ manually annotated interactions across 13 safety categories for comprehensive content safety testing

New Red Team Capabilities

Intent Plugin Enhancements

The Intent (Custom Prompts) plugin now supports JSON file uploads with nested arrays for multi-step attack sequences. The enhanced UI makes it easier to manage complex test scenarios.

Enhanced HTTP Provider Support

Red team tests now include automatic token estimation for HTTP providers, helping you track costs even with custom API integrations.

System Prompt Override Testing

A new System Prompt Override plugin tests whether your LLM deployment is vulnerable to system instruction manipulation—a critical security flaw that could disable safety features.

nirbarazida / promptfoo mirror of https://github.com/promptfoo/promptfoo

releases.md 16 KB Permalink History Raw

Release Notes

July 2025 Release Highlights

Evals

New Models / Providers

Expanded Provider Support

New Model Support

Enhanced Capabilities

Assertion Improvements

Other Features

Developer Experience Improvements

Web UI Improvements

Enhanced Eval Results Page

Red Teaming

Enterprise Features

Strategies

New Agentic Multi-Turn Strategies

Other Strategy Improvements

See Also

June 2025 Release Highlights

Evals

Tracing

See Inside Your LLM Applications with OpenTelemetry

New Models / Providers

Expanded Audio and Multimodal Capabilities

Other Features

Static Model Scanning with ModelAudit

Developer Experience Improvements

Red Teaming

Enterprise Features

Advanced Testing Capabilities for Teams

Plugins

Comprehensive Safety Testing for High-Stakes Domains

Medical Safety Testing

Financial Risk Detection

Bias Detection Suite

Enterprise-Grade Datasets

New Red Team Capabilities

Intent Plugin Enhancements

Enhanced HTTP Provider Support

System Prompt Override Testing

See Also

Strategies

Smarter Multi-Turn Attack Techniques

See Also

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

nirbarazida
/
promptfoo
mirror of https://github.com/promptfoo/promptfoo

releases.md 16 KB

Permalink History Raw