Are you sure you want to delete this access key?
This document outlines the plan to integrate Verdict as a model-graded metric in Promptfoo. Verdict is a declarative framework for building compound LLM-as-a-judge systems that scales up "judge-time compute" to create more reliable evaluations.
Verdict addresses the unreliability of simple LLM judges by composing multiple rounds of reasoning, verification, and aggregation. Instead of a single LLM call, Verdict synthesizes multiple units of evaluation to produce more robust judgments.
graph TD
A[Input Data] --> B[Verdict Pipeline]
B --> C[Layer 1: Judge Units]
C --> D[Layer 2: Verification Units]
D --> E[Layer 3: Aggregation Unit]
E --> F[Final Score/Decision]
C1[Judge 1] --> D1[Verify 1]
C2[Judge 2] --> D2[Verify 2]
C3[Judge 3] --> D3[Verify 3]
D1 --> E
D2 --> E
D3 --> E
subgraph "Layer 1"
C1
C2
C3
end
subgraph "Layer 2"
D1
D2
D3
end
subgraph "Layer 3"
E[MaxPoolUnit]
end
sequenceDiagram
participant U as Unit
participant E as Extractor
participant LLM as LLM Provider
participant V as Validator
U->>U: Prepare Input Schema
U->>U: Format Prompt
U->>E: Extract Response
E->>LLM: API Call
LLM-->>E: Raw Response
E-->>U: Parsed Response
U->>V: Validate Response
V-->>U: Validation Result
U->>U: Process & Propagate
U-->>U: Output Schema
graph LR
subgraph "CHAIN Pattern"
A1[Unit A] --> B1[Unit B] --> C1[Unit C]
end
subgraph "DENSE Pattern"
A2[Unit A] --> C2[Unit C]
A2 --> D2[Unit D]
B2[Unit B] --> C2
B2 --> D2
end
subgraph "BROADCAST Pattern"
A3[Unit A] --> C3[Unit C]
B3[Unit B] --> D3[Unit D]
end
subgraph "CUMULATIVE Pattern"
A4[Unit A] --> C4[Unit C]
A4 --> D4[Unit D]
B4[Unit B] --> D4
end
JudgeUnit(DiscreteScale((1, 5)))
.prompt("Evaluate the politeness...")
.via('gpt-4o-mini', retries=3, temperature=0.4)
Pipeline() \
>> JudgeUnit(explanation=True) # Initial judgment with explanation
>> VerifyUnit() # Verify the explanation is sound
Pipeline() \
>> Layer(
JudgeUnit() >> VerifyUnit(), # Create judge-verify pairs
repeat=3 # Repeat 3 times
) \
>> MaxPoolUnit() # Vote for most common answer
Type System
Async/Await Pattern
Configuration
// Scale types
interface Scale<T> {
values: T[];
validate(value: T): boolean;
serialize(): string;
}
class DiscreteScale implements Scale<string> {
constructor(public values: string[]) {}
}
class LikertScale implements Scale<number> {
constructor(
public min: number,
public max: number,
) {}
}
// Unit types
interface Unit<TInput, TOutput> {
execute(input: TInput): Promise<TOutput>;
validate(input: TInput, response: any): void;
process(input: TInput, response: any): TOutput;
}
// Pipeline types
interface Pipeline {
units: Unit<any, any>[];
execute(input: any): Promise<any>;
}
Create TypeScript Types
Implement Base Classes
VerdictUnit
- Base class for all evaluation unitsVerdictLayer
- Container for multiple unitsVerdictPipeline
- Main execution orchestratorJudge Units
DirectScoreJudgeUnit
- Direct numerical scoringCategoricalJudgeUnit
- Yes/No or categorical decisionsPairwiseJudgeUnit
- Compare two optionsAggregation Units
MaxPoolUnit
- Majority votingMeanPoolUnit
- Average scoringWeightedPoolUnit
- Weighted aggregationVerification Units
VerifyUnit
- Verify explanations are validConsistencyUnit
- Check for consistent reasoningCreate Assertion Handler
// src/assertions/verdict.ts
export const handleVerdict = async ({
assertion,
outputString,
test,
}: AssertionParams): Promise<GradingResult>
Register Assertion Type
Configuration Schema
assert:
- type: verdict
value:
pipeline:
- type: judge
prompt: 'Evaluate politeness...'
scale: [1, 5]
- type: verify
prompt: 'Check if explanation is valid...'
repeat: 3
aggregate: max
Custom Extractors
Rate Limiting
Visualization
src/
├── assertions/
│ └── verdict/
│ ├── index.ts # Main assertion handler
│ ├── types.ts # TypeScript types
│ ├── units/ # Unit implementations
│ │ ├── base.ts
│ │ ├── judge.ts
│ │ ├── verify.ts
│ │ └── aggregate.ts
│ ├── scales.ts # Scale implementations
│ ├── pipeline.ts # Pipeline executor
│ └── prompts.ts # Default prompts
# Simple categorical judgment
assert:
- type: verdict
value:
type: categorical
prompt: "Is this response helpful?"
categories: ["yes", "no"]
threshold: 0.8
# Likert scale rating
assert:
- type: verdict
value:
type: likert
prompt: "Rate the quality of this response"
scale: [1, 5]
explanation: true
threshold: 4
# Pairwise comparison
assert:
- type: verdict
value:
type: pairwise
prompt: "Which response is more accurate?"
options: ["{{output}}", "{{vars.baseline}}"]
# Hierarchical verification with majority voting
assert:
- type: verdict
value:
pipeline:
- layer:
unit:
type: categorical-judge
prompt: "Does this response contain factual errors?"
categories: ["yes", "no"]
explanation: true
repeat: 3
- layer:
unit:
type: verify
prompt: |
Check if the explanation correctly identifies factual errors:
Explanation: {{previous.explanation}}
Response: {{output}}
- type: max-pool
threshold: 0.7
# Multi-criteria evaluation
assert:
- type: verdict
value:
pipeline:
- layer:
units:
- type: likert
name: accuracy
prompt: "Rate accuracy (1-5)"
scale: [1, 5]
- type: likert
name: clarity
prompt: "Rate clarity (1-5)"
scale: [1, 5]
- type: likert
name: completeness
prompt: "Rate completeness (1-5)"
scale: [1, 5]
- type: weighted-mean
weights:
accuracy: 0.5
clarity: 0.3
completeness: 0.2
threshold: 4.0
# Debate-style evaluation
assert:
- type: verdict
value:
pipeline:
- layer:
units:
- type: debate-for
prompt: "Argue why this response is good"
- type: debate-against
prompt: "Argue why this response is bad"
repeat: 2
- type: judge
prompt: "Based on the debate, is this a good response?"
categories: ["yes", "no"]
# Using with providers
assert:
- type: verdict
provider: gpt-4o-mini # Override default provider
value:
type: categorical
prompt: "Is this appropriate?"
categories: ["yes", "no"]
# Using with variables
assert:
- type: verdict
value:
type: likert
prompt: |
Given the requirements: {{vars.requirements}}
Rate how well this response meets them.
scale: [1, 10]
metric: requirement-satisfaction
# Combining with other assertions
assert:
- type: contains
value: "{{vars.required_keyword}}"
- type: verdict
value:
type: categorical
prompt: "Is the tone professional?"
categories: ["yes", "no"]
- type: llm-rubric
value: "Response should be concise"
This implementation is based on the Verdict library by Haize Labs. Key paper:
@article{kalra2025verdict,
title={VERDICT: A Library for Scaling Judge-Time Compute},
author={Kalra, Nimit and Tang, Leonard},
journal={arXiv preprint arXiv:2502.18018},
year={2025}
}
Feature | Traditional (llm-rubric, g-eval) | Verdict |
---|---|---|
Single LLM Call | ✅ Yes | ❌ No (multiple calls) |
Verification | ❌ No | ✅ Yes (built-in) |
Ensemble Support | ❌ No | ✅ Yes (native) |
Structured Scales | ❌ No | ✅ Yes (typed) |
Debate/Discussion | ❌ No | ✅ Yes |
Token Efficiency | ✅ More efficient | ⚠️ Uses more tokens |
Reliability | ⚠️ Variable | ✅ Higher |
Set up development branch
git checkout -b feature/verdict-integration
Create initial file structure
mkdir -p src/assertions/verdict/{units,scales}
touch src/assertions/verdict/{index.ts,types.ts,pipeline.ts}
Implement core types and base classes
Add basic unit tests
Create documentation and examples
Should we support all verdict features or start with a subset?
How to handle visualization in the Promptfoo UI?
Should configuration be YAML-based or support programmatic definition?
How to integrate with existing grading providers?
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?