AgentJudge¶
A specialized agent for evaluating and judging outputs from other agents or systems. Acts as a quality control mechanism providing objective assessments and feedback.
Based on the research paper: "Agent-as-a-Judge: Evaluate Agents with Agents" - arXiv:2410.10934
Overview¶
The AgentJudge is designed to evaluate and critique outputs from other AI agents, providing structured feedback on quality, accuracy, and areas for improvement. It supports both single-shot evaluations and iterative refinement through multiple evaluation loops with context building.
Key capabilities:
-
Quality Assessment: Evaluates correctness, clarity, and completeness of agent outputs
-
Structured Feedback: Provides detailed critiques with strengths, weaknesses, and suggestions
-
Multimodal Support: Can evaluate text outputs alongside images
-
Context Building: Maintains evaluation context across multiple iterations
-
Batch Processing: Efficiently processes multiple evaluations
Architecture¶
graph TD
A[Input Task] --> B[AgentJudge]
B --> C{Evaluation Mode}
C -->|step()| D[Single Eval]
C -->|run()| E[Iterative Eval]
C -->|run_batched()| F[Batch Eval]
D --> G[Agent Core]
E --> G
F --> G
G --> H[LLM Model]
H --> I[Quality Analysis]
I --> J[Feedback & Output]
subgraph "Feedback Details"
N[Strengths]
O[Weaknesses]
P[Improvements]
Q[Accuracy Check]
end
J --> N
J --> O
J --> P
J --> Q
Class Reference¶
Constructor¶
AgentJudge(
id: str = str(uuid.uuid4()),
agent_name: str = "Agent Judge",
description: str = "You're an expert AI agent judge...",
system_prompt: str = AGENT_JUDGE_PROMPT,
model_name: str = "openai/o1",
max_loops: int = 1,
verbose: bool = False,
*args,
**kwargs
)
Parameters¶
Parameter | Type | Default | Description |
---|---|---|---|
id |
str |
str(uuid.uuid4()) |
Unique identifier for the judge instance |
agent_name |
str |
"Agent Judge" |
Name of the agent judge |
description |
str |
"You're an expert AI agent judge..." |
Description of the agent's role |
system_prompt |
str |
AGENT_JUDGE_PROMPT |
System instructions for evaluation |
model_name |
str |
"openai/o1" |
LLM model for evaluation |
max_loops |
int |
1 |
Maximum evaluation iterations |
verbose |
bool |
False |
Enable verbose logging |
Methods¶
step()¶
Processes a single task or list of tasks and returns evaluation.
Parameter | Type | Default | Description |
---|---|---|---|
task |
str |
None |
Single task/output to evaluate |
tasks |
List[str] |
None |
List of tasks/outputs to evaluate |
img |
str |
None |
Path to image for multimodal evaluation |
Returns: str
- Detailed evaluation response
run()¶
Executes evaluation in multiple iterations with context building.
Parameter | Type | Default | Description |
---|---|---|---|
task |
str |
None |
Single task/output to evaluate |
tasks |
List[str] |
None |
List of tasks/outputs to evaluate |
img |
str |
None |
Path to image for multimodal evaluation |
Returns: List[str]
- List of evaluation responses from each iteration
run_batched()¶
run_batched(
tasks: Optional[List[str]] = None,
imgs: Optional[List[str]] = None
) -> List[List[str]]
Executes batch evaluation of multiple tasks with corresponding images.
Parameter | Type | Default | Description |
---|---|---|---|
tasks |
List[str] |
None |
List of tasks/outputs to evaluate |
imgs |
List[str] |
None |
List of image paths (same length as tasks) |
Returns: List[List[str]]
- Evaluation responses for each task
Examples¶
Basic Usage¶
from swarms import AgentJudge
# Initialize with default settings
judge = AgentJudge()
# Single task evaluation
result = judge.step(task="The capital of France is Paris.")
print(result)
Custom Configuration¶
from swarms import AgentJudge
# Custom judge configuration
judge = AgentJudge(
agent_name="content-evaluator",
model_name="gpt-4",
max_loops=3,
verbose=True
)
# Evaluate multiple outputs
outputs = [
"Agent CalculusMaster: The integral of x^2 + 3x + 2 is (1/3)x^3 + (3/2)x^2 + 2x + C",
"Agent DerivativeDynamo: The derivative of sin(x) is cos(x)",
"Agent LimitWizard: The limit of sin(x)/x as x approaches 0 is 1"
]
evaluation = judge.step(tasks=outputs)
print(evaluation)
Iterative Evaluation with Context¶
from swarms import AgentJudge
# Multiple iterations with context building
judge = AgentJudge(max_loops=3)
# Each iteration builds on previous context
evaluations = judge.run(task="Agent output: 2+2=5")
for i, eval_result in enumerate(evaluations):
print(f"Iteration {i+1}: {eval_result}\n")
Multimodal Evaluation¶
from swarms import AgentJudge
judge = AgentJudge()
# Evaluate with image
evaluation = judge.step(
task="Describe what you see in this image",
img="path/to/image.jpg"
)
print(evaluation)
Batch Processing¶
from swarms import AgentJudge
judge = AgentJudge()
# Batch evaluation with images
tasks = [
"Describe this chart",
"What's the main trend?",
"Any anomalies?"
]
images = [
"chart1.png",
"chart2.png",
"chart3.png"
]
# Each task evaluated independently
evaluations = judge.run_batched(tasks=tasks, imgs=images)
for i, task_evals in enumerate(evaluations):
print(f"Task {i+1} evaluations: {task_evals}")
Reference¶
@misc{zhuge2024agentasajudgeevaluateagentsagents,
title={Agent-as-a-Judge: Evaluate Agents with Agents},
author={Mingchen Zhuge and Changsheng Zhao and Dylan Ashley and Wenyi Wang and Dmitrii Khizbullin and Yunyang Xiong and Zechun Liu and Ernie Chang and Raghuraman Krishnamoorthi and Yuandong Tian and Yangyang Shi and Vikas Chandra and Jürgen Schmidhuber},
year={2024},
eprint={2410.10934},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2410.10934}
}