Agent Evaluation Methodology

Overview

Evaluating AI agents is fundamentally different from evaluating traditional software. Agents are non-deterministic, may take multiple valid paths to a solution, and produce outputs that span multiple quality dimensions simultaneously. This guide provides a structured methodology for measuring, comparing, and improving agent performance.

Core principle: Judge results, not exact paths. An agent that takes a different route to the correct answer is not wrong -- it may even be better.

When to Use

Testing a new agent or command before deployment
Comparing two prompt variants to determine which performs better
Regression testing after modifying an agent's system prompt or tools
Establishing baseline quality metrics for continuous monitoring
Debugging inconsistent agent behavior across different inputs

Core Evaluation Challenges

Agent evaluation must account for three fundamental difficulties:

1. Non-Determinism

The same agent with the same input may produce different outputs across runs. Temperature, sampling, and internal state contribute to variance.

Implication: Single-run evaluation is unreliable. Measure across multiple runs (minimum 3-5) and report distributions, not single scores.

2. Multiple Valid Paths

For most real-world tasks, there is more than one correct approach. A code review agent might organize findings by file or by severity -- both are valid.

Implication: Rubrics must evaluate outcome quality, not path similarity. Avoid comparing against a single "gold standard" output.

3. Composite Quality

Agent output quality is not a single number. An output can follow instructions perfectly but be poorly organized, or be well-written but miss half the requirements.

Implication: Use multi-dimensional rubrics that score each quality aspect independently.

Multi-Dimensional Evaluation Rubric

Score agent outputs across five weighted dimensions:

Dimension	Weight	What It Measures	Scale
Instruction Following	0.30	Did the agent do what was asked?	1-5
Output Completeness	0.25	Does the output cover all requirements?	1-5
Tool Efficiency	0.20	Did the agent use tools optimally (no redundant calls, no missing reads)?	1-5
Reasoning Quality	0.15	Is the agent's logic sound and its conclusions justified?	1-5
Response Coherence	0.10	Is the output well-structured, clear, and internally consistent?	1-5

Weighted Score Calculation

Final Score = (Instruction × 0.30) + (Completeness × 0.25) +
              (Efficiency × 0.20) + (Reasoning × 0.15) +
              (Coherence × 0.10)

Score interpretation:

Range	Quality Level	Action
4.5 - 5.0	Excellent	Production-ready; monitor for regression
3.5 - 4.4	Good	Acceptable; targeted improvements possible
2.5 - 3.4	Fair	Needs work; identify weakest dimensions
1.0 - 2.4	Poor	Significant redesign required

Dimension-Level Rubric Example

Instruction Following (Weight: 0.30):

Score	Description	Characteristics
5	Complete adherence	All instructions followed; nothing missed; no unnecessary additions
4	Minor deviations	1-2 instructions partially addressed; core task completed
3	Partial adherence	Major instructions followed; several secondary requirements missed
2	Significant gaps	Core task partially completed; multiple instructions ignored
1	Non-compliant	Output does not address the requested task

Evaluation Methodologies

LLM-as-Judge: Direct Scoring

Use a separate LLM instance to score agent outputs against the rubric. This is the most scalable approach for automated evaluation.

Setup:

You are an expert evaluator. Score the following agent output against
each rubric dimension on a 1-5 scale.

## Task Description
[Original task given to the agent]

## Agent Output
[Output to evaluate]

## Rubric
[Full rubric with dimension descriptions and scale definitions]

## Instructions
For each dimension:
1. Quote specific parts of the output that inform your score
2. Explain your reasoning in 1-2 sentences
3. Assign a score (1-5)
4. Calculate the weighted final score

Advantages: Scalable, consistent, can evaluate hundreds of outputs per hour. Limitations: Judge LLM has its own biases; requires calibration.

LLM-as-Judge: Pairwise Comparison

Instead of absolute scoring, present two agent outputs and ask which is better. This is more reliable for detecting subtle quality differences.

Setup:

You are an expert evaluator. Compare these two agent outputs for the
same task and determine which is better.

## Task Description
[Original task]

## Output A
[First agent's output]

## Output B
[Second agent's output]

## Instructions
For each rubric dimension:
1. Compare both outputs
2. State which is better and why
3. Assign a preference: A >> B, A > B, A = B, B > A, B >> A

Finally, give an overall winner with reasoning.

Advantages: More reliable for close comparisons; eliminates absolute scale calibration issues. Limitations: Only provides relative ranking; does not tell you if both outputs are bad.

Combined Approach

For thorough evaluation, use both methods:

Direct scoring to establish absolute quality baselines
Pairwise comparison to validate that improvements are real
Cross-validation by running the same evaluation with different judge prompts

Bias Mitigation

LLM judges carry systematic biases that can distort evaluation results. Identify and mitigate each one.

Position Bias

Problem: The judge favors whichever output appears first (or last) in the comparison prompt.

Mitigation:

Run each pairwise comparison twice with swapped positions (A|B and B|A)
Only count a preference if the judge is consistent across both orderings
Discard results where the judge flips preference based on position

Length Bias (Verbosity Bias)

Problem: The judge favors longer, more verbose outputs regardless of information density.

Mitigation:

Add explicit instruction: "A concise output that covers all requirements is preferable to a verbose output with padding"
Include examples where the shorter output is scored higher
Normalize scores by information density, not raw token count

Self-Enhancement Bias

Problem: When the judge model is the same as (or similar to) the agent model, it may favor outputs that match its own style.

Mitigation:

Use a different model as judge when possible
Include diverse example outputs in the judge prompt to calibrate expectations
Cross-validate with human evaluation on a sample

Authority Bias

Problem: The judge defers to outputs that cite sources, use confident language, or invoke authority, even when the content is incorrect.

Mitigation:

Instruct the judge to verify factual claims, not just accept them
Include test cases where a confident but wrong output should score lower than a hedged but correct one
Separate "confidence" from "correctness" in the rubric

Test Set Design

The quality of your evaluation is bounded by the quality of your test set.

Complexity Stratification

Design test cases across three complexity levels:

Level	Characteristics	Purpose	Proportion
Simple	Single-step, unambiguous input, one correct answer	Baseline functionality verification	30%
Moderate	Multi-step, some ambiguity, 2-3 valid approaches	Typical use case coverage	50%
Complex	Multi-file, ambiguous requirements, edge cases, requires reasoning	Stress testing and ceiling measurement	20%

Representative Sampling

Ensure your test set covers:

Input variety: Different phrasing, different levels of detail in user requests
Domain spread: If the agent works across multiple domains, test all of them
Edge cases: Empty input, malformed input, conflicting requirements, very large inputs
Failure modes: Inputs designed to trigger known weaknesses

Test Case Template

## Test Case: [ID]-[Brief Description]

**Complexity:** Simple | Moderate | Complex
**Category:** [Domain or feature being tested]

**Input:**
[Exact input to the agent]

**Context:**
[Any additional context the agent receives: files, history, etc.]

**Expected Behavior:**
- [Requirement 1]
- [Requirement 2]
- [Requirement 3]

**Evaluation Notes:**
[What to look for in the output; common failure modes for this case]

Rubric Generation

For a new agent, generate a tailored rubric before running any evaluations.

Step-by-Step Rubric Creation

Define the task domain. What does this agent do? What does a good output look like?
Identify quality dimensions. Start with the five standard dimensions, then add domain-specific ones if needed (e.g., "Security Correctness" for a security review agent).
Write level descriptions. For each dimension, define what a score of 1, 3, and 5 looks like. Fill in 2 and 4 by interpolation.
Add characteristics. For each level, list 2-3 observable characteristics that a judge can check for.
Include examples. Provide at least one example output for a score of 5 and one for a score of 2. This anchors the judge's calibration.
Document edge cases. Describe scenarios where scoring is ambiguous and provide guidance.

Example Domain-Specific Dimension

Security Correctness (for a security review agent):

Score	Description	Characteristics
5	All vulnerabilities identified with correct severity and remediation	Zero false negatives; remediations are production-ready
4	Most vulnerabilities found; 1 minor miss	May miss a low-severity issue; remediations are correct
3	Major vulnerabilities found; some gaps	Misses 1-2 medium-severity issues; some remediations are vague
2	Partial coverage with false positives	Identifies some issues but also flags non-issues; confuses severity
1	Fails to identify critical vulnerabilities	Misses critical issues; output is unreliable for security decisions

Evaluation Workflows

Workflow 1: Testing a New Agent

1. Create test set (10-20 cases across complexity levels)
2. Run agent on all test cases (3 runs each)
3. Score all outputs using direct scoring
4. Compute per-dimension and overall averages
5. Identify dimensions scoring below 3.5
6. Refine agent prompt targeting weak dimensions
7. Re-run and compare using pairwise evaluation
8. Iterate until all dimensions reach target threshold

Workflow 2: Comparing Prompt Variants

1. Select 10 representative test cases
2. Run variant A on all cases (3 runs each)
3. Run variant B on all cases (3 runs each)
4. Run pairwise comparison (with position swapping) on all pairs
5. Calculate win rate for each variant
6. Check for statistical significance (min 60% consistent preference)
7. Analyze dimension-level differences for insights

Workflow 3: Regression Testing

1. Maintain a fixed test set of 15-20 cases (do not modify between runs)
2. Run baseline agent and store scores
3. After any agent modification, run the same test set
4. Compare per-dimension scores against baseline
5. Flag any dimension that drops more than 0.3 points
6. Investigate and fix regressions before deployment

Workflow 4: Continuous Quality Monitoring

1. Sample 5% of production agent invocations
2. Score sampled outputs using automated direct scoring
3. Track weekly averages per dimension
4. Alert if any dimension drops below threshold
5. Monthly: review trend data and identify systematic patterns

Metrics Reference

Metric	Use Case	Interpretation
Precision	How many of the agent's claims/findings are correct	High precision = few false positives
Recall	How many of the actual issues did the agent find	High recall = few false negatives
F1 Score	Harmonic mean of precision and recall	Balanced measure; use when both matter equally
Cohen's Kappa (kappa)	Agreement between two judges beyond chance	> 0.6 = substantial agreement; > 0.8 = strong
Spearman rho	Rank correlation between judge scores and human scores	> 0.7 = strong correlation; validates judge reliability
Win Rate	Percentage of pairwise comparisons won	> 60% with position swapping = meaningful improvement
Score Variance	Standard deviation across multiple runs	High variance = agent behavior is unstable

Anti-Patterns

Anti-Pattern	Problem	Fix
Single-run evaluation	Non-determinism makes single runs unreliable	Run 3-5 times; report distributions
Gold-standard comparison	Penalizes valid alternative approaches	Use rubric-based evaluation, not exact-match
Same model as judge	Self-enhancement bias inflates scores	Use a different model or calibrate with human samples
No position swapping	Position bias corrupts pairwise results	Always run A
Testing only happy path	Misses failure modes that matter in production	Include edge cases and adversarial inputs in test set
Ignoring dimension breakdown	Overall score hides specific weaknesses	Always report per-dimension scores alongside aggregate
Static test set	Test cases become stale as the agent evolves	Refresh 20% of test cases quarterly with new patterns
Evaluating process, not outcome	Penalizes agents for valid alternative paths	Judge the final output quality, not the intermediate steps

Agent Evaluation Methodology

Agent Evaluation Methodology

Overview

When to Use

Core Evaluation Challenges

1. Non-Determinism

2. Multiple Valid Paths

3. Composite Quality

Multi-Dimensional Evaluation Rubric

Weighted Score Calculation

Dimension-Level Rubric Example

Evaluation Methodologies

LLM-as-Judge: Direct Scoring

LLM-as-Judge: Pairwise Comparison

Combined Approach

Bias Mitigation

Position Bias

Length Bias (Verbosity Bias)

Self-Enhancement Bias

Authority Bias

Test Set Design

Complexity Stratification

Representative Sampling

Test Case Template

Rubric Generation

Step-by-Step Rubric Creation

Example Domain-Specific Dimension

Evaluation Workflows

Workflow 1: Testing a New Agent

Workflow 2: Comparing Prompt Variants

Workflow 3: Regression Testing

Workflow 4: Continuous Quality Monitoring

Metrics Reference

Anti-Patterns

相关技能 Related Skills

Agent System Prompt Design

Agent Context Engineering

Dispatching Parallel Agents