Skip to content

AgentJudge

A specialized agent for evaluating and judging outputs from other agents or systems. Acts as a quality control mechanism providing objective assessments and feedback.

Based on the research paper: "Agent-as-a-Judge: Evaluate Agents with Agents" - arXiv:2410.10934

Overview

The AgentJudge is designed to evaluate and critique outputs from other AI agents, providing structured feedback on quality, accuracy, and areas for improvement. It supports both single-shot evaluations and iterative refinement through multiple evaluation loops with context building.

Key capabilities:

  • Quality Assessment: Evaluates correctness, clarity, and completeness of agent outputs

  • Structured Feedback: Provides detailed critiques with strengths, weaknesses, and suggestions

  • Multimodal Support: Can evaluate text outputs alongside images

  • Context Building: Maintains evaluation context across multiple iterations

  • Batch Processing: Efficiently processes multiple evaluations

Architecture

graph TD
    A[Input Task] --> B[AgentJudge]
    B --> C{Evaluation Mode}

    C -->|step()| D[Single Eval]
    C -->|run()| E[Iterative Eval]
    C -->|run_batched()| F[Batch Eval]

    D --> G[Agent Core]
    E --> G
    F --> G

    G --> H[LLM Model]
    H --> I[Quality Analysis]
    I --> J[Feedback & Output]

    subgraph "Feedback Details"
        N[Strengths]
        O[Weaknesses]
        P[Improvements]
        Q[Accuracy Check]
    end

    J --> N
    J --> O
    J --> P
    J --> Q

Class Reference

Constructor

AgentJudge(
    id: str = str(uuid.uuid4()),
    agent_name: str = "Agent Judge",
    description: str = "You're an expert AI agent judge...",
    system_prompt: str = AGENT_JUDGE_PROMPT,
    model_name: str = "openai/o1",
    max_loops: int = 1,
    verbose: bool = False,
    *args,
    **kwargs
)

Parameters

Parameter Type Default Description
id str str(uuid.uuid4()) Unique identifier for the judge instance
agent_name str "Agent Judge" Name of the agent judge
description str "You're an expert AI agent judge..." Description of the agent's role
system_prompt str AGENT_JUDGE_PROMPT System instructions for evaluation
model_name str "openai/o1" LLM model for evaluation
max_loops int 1 Maximum evaluation iterations
verbose bool False Enable verbose logging

Methods

step()

step(
    task: str = None,
    tasks: Optional[List[str]] = None,
    img: Optional[str] = None
) -> str

Processes a single task or list of tasks and returns evaluation.

Parameter Type Default Description
task str None Single task/output to evaluate
tasks List[str] None List of tasks/outputs to evaluate
img str None Path to image for multimodal evaluation

Returns: str - Detailed evaluation response

run()

run(
    task: str = None,
    tasks: Optional[List[str]] = None,
    img: Optional[str] = None
) -> List[str]

Executes evaluation in multiple iterations with context building.

Parameter Type Default Description
task str None Single task/output to evaluate
tasks List[str] None List of tasks/outputs to evaluate
img str None Path to image for multimodal evaluation

Returns: List[str] - List of evaluation responses from each iteration

run_batched()

run_batched(
    tasks: Optional[List[str]] = None,
    imgs: Optional[List[str]] = None
) -> List[List[str]]

Executes batch evaluation of multiple tasks with corresponding images.

Parameter Type Default Description
tasks List[str] None List of tasks/outputs to evaluate
imgs List[str] None List of image paths (same length as tasks)

Returns: List[List[str]] - Evaluation responses for each task

Examples

Basic Usage

from swarms import AgentJudge

# Initialize with default settings
judge = AgentJudge()

# Single task evaluation
result = judge.step(task="The capital of France is Paris.")
print(result)

Custom Configuration

from swarms import AgentJudge

# Custom judge configuration
judge = AgentJudge(
    agent_name="content-evaluator",
    model_name="gpt-4",
    max_loops=3,
    verbose=True
)

# Evaluate multiple outputs
outputs = [
    "Agent CalculusMaster: The integral of x^2 + 3x + 2 is (1/3)x^3 + (3/2)x^2 + 2x + C",
    "Agent DerivativeDynamo: The derivative of sin(x) is cos(x)",
    "Agent LimitWizard: The limit of sin(x)/x as x approaches 0 is 1"
]

evaluation = judge.step(tasks=outputs)
print(evaluation)

Iterative Evaluation with Context

from swarms import AgentJudge

# Multiple iterations with context building
judge = AgentJudge(max_loops=3)

# Each iteration builds on previous context
evaluations = judge.run(task="Agent output: 2+2=5")
for i, eval_result in enumerate(evaluations):
    print(f"Iteration {i+1}: {eval_result}\n")

Multimodal Evaluation

from swarms import AgentJudge

judge = AgentJudge()

# Evaluate with image
evaluation = judge.step(
    task="Describe what you see in this image",
    img="path/to/image.jpg"
)
print(evaluation)

Batch Processing

from swarms import AgentJudge

judge = AgentJudge()

# Batch evaluation with images
tasks = [
    "Describe this chart",
    "What's the main trend?",
    "Any anomalies?"
]
images = [
    "chart1.png",
    "chart2.png", 
    "chart3.png"
]

# Each task evaluated independently
evaluations = judge.run_batched(tasks=tasks, imgs=images)
for i, task_evals in enumerate(evaluations):
    print(f"Task {i+1} evaluations: {task_evals}")

Reference

@misc{zhuge2024agentasajudgeevaluateagentsagents,
    title={Agent-as-a-Judge: Evaluate Agents with Agents}, 
    author={Mingchen Zhuge and Changsheng Zhao and Dylan Ashley and Wenyi Wang and Dmitrii Khizbullin and Yunyang Xiong and Zechun Liu and Ernie Chang and Raghuraman Krishnamoorthi and Yuandong Tian and Yangyang Shi and Vikas Chandra and Jürgen Schmidhuber},
    year={2024},
    eprint={2410.10934},
    archivePrefix={arXiv},
    primaryClass={cs.AI},
    url={https://arxiv.org/abs/2410.10934}
}