AI Agent Evaluation — Testing and Scoring AI Agents in Production

Posted on: 5/6/2026 10:10:15 AM

Table of contents

Why Evaluating AI Agents Is Harder Than Evaluating Standalone LLMs
1. The Core Challenge
Evaluation Architecture: The Three-Layer Framework
Trajectory Metrics vs Outcome Metrics
1. Best Practice
LLM-as-Judge: When AI Scores AI
1. Designing Effective Rubrics
2. Mitigating LLM-Judge Bias
  1. Critical Warning
Pass@k and Pass^k: Two Faces of Reliability
1. Pass@k — Probability of success at least once in k attempts
2. Pass^k — Probability of success on ALL k attempts
Popular AI Agent Benchmarks (2026)
1. Recommendation
Integrating Evaluation into CI/CD Pipelines
1. Progressive Deployment Gates
2. Continuous Evaluation Strategy
Frameworks and Evaluation Tools
Evaluation by Agent Type
Implementation Roadmap for Teams
Anti-patterns to Avoid
1. Don't Make These Mistakes
Conclusion
1. References

Why Evaluating AI Agents Is Harder Than Evaluating Standalone LLMs

A standalone LLM receives a prompt and returns a response. You can grade it with BLEU, ROUGE, or human review. But AI Agents are different: they reason across multiple steps, call tools, receive results, then continue reasoning — a long chain of actions where any single mistake causes cascade failure.

37% Gap between benchmark and production performance

74% Production agents still need human-in-the-loop evaluation

80% Agreement between LLM-Judge and human raters

500–5000x Cost savings compared to human review

The Core Challenge

AI Agents are non-deterministic — the same input can produce 10 different execution paths across 10 runs, all potentially valid. Evaluation must assess both trajectory (the path taken) and outcome (the final result), not just one or the other.

Evaluation Architecture: The Three-Layer Framework

graph TD
    A[AI Agent System] --> B[Reasoning Layer]
    A --> C[Action Layer]
    A --> D[Overall Execution]
    B --> B1[Plan Quality]
    B --> B2[Plan Adherence]
    B --> B3[Task Decomposition]
    C --> C1[Tool Selection]
    C --> C2[Argument Correctness]
    C --> C3[Error Handling]
    D --> D1[Task Completion]
    D --> D2[Step Efficiency]
    D --> D3[Response Quality]
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style B1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C1 fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style C2 fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style C3 fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style D1 fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style D2 fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style D3 fill:#f8f9fa,stroke:#ff9800,color:#2c3e50

Three evaluation layers for AI Agents: Reasoning, Action, and Overall Execution

Layer 1 — Reasoning Layer

Evaluates planning quality and task decomposition. Does the agent produce a logical plan? Does it adhere to that plan during execution?

PlanQualityMetric: Is the plan complete, logical, and feasible?
PlanAdherenceMetric: Does the agent deviate from its initial plan?
TaskDecomposition: Is the complex task broken into appropriate sub-tasks?

Layer 2 — Action Layer

Evaluates tool calling — did the agent select the right tool, pass correct arguments, and handle errors appropriately?

ToolCorrectnessMetric: Was the appropriate tool selected for the context?
ArgumentCorrectnessMetric: Are arguments valid, complete, and correctly typed?
ErrorRecovery: When a tool fails, does the agent retry/fallback appropriately?

Layer 3 — Overall Execution

Evaluates final results and overall efficiency.

TaskCompletionMetric: Was the task completed according to requirements?
StepEfficiencyMetric: Are there redundant steps or meaningless loops?
ResponseQuality: Is the final output accurate, complete, and useful?

Trajectory Metrics vs Outcome Metrics

These two evaluation approaches complement each other:

Criteria	Trajectory Metrics	Outcome Metrics
What it measures	The complete execution path — every reasoning step, tool call, decision	The final result — did the task complete correctly?
Strength	Reveals why an agent failed	Simple, directly measures business value
Weakness	May reject creative but valid paths	No insight into failure cause
When to use	Debugging, development, optimizing agent behavior	Production monitoring, regression testing
Example	Agent chose search tool before trying SQL query (wrong order)	Agent returned correct results on 95% of 1000 test cases

Best Practice

Use outcome metrics as the primary signal in production (pass/fail). When outcome metrics drop, use trajectory metrics to debug root cause. Never rely on just one — combine both for a comprehensive picture.

LLM-as-Judge: When AI Scores AI

This method uses a powerful LLM (typically Claude or GPT-4) as a "judge" to evaluate agent output. It achieves ~80% agreement with human raters at 500–5000x lower cost.

graph LR
    A[Agent Output + Context] --> B[Judge LLM]
    C[Evaluation Rubric] --> B
    D[Few-shot Examples] --> B
    B --> E[Structured Score + Reasoning]
    E --> F{Pass Threshold?}
    F -->|Yes| G[Deploy/Continue]
    F -->|No| H[Flag for Review]
    style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B fill:#e94560,stroke:#fff,color:#fff
    style C fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style D fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style E fill:#2c3e50,stroke:#fff,color:#fff
    style G fill:#4CAF50,stroke:#fff,color:#fff
    style H fill:#ff9800,stroke:#fff,color:#fff

LLM-as-Judge pipeline: Agent output scored by Judge LLM based on rubric and examples

Designing Effective Rubrics

The rubric is the single most important factor for LLM-Judge quality. A good rubric needs:

Specific: Convert every criterion into measurable yes/no questions
Evidence-based: Require the judge to cite evidence from the output
Hierarchical: Organize in tiers (7 dimensions → 25 sub-dimensions → 130 items)
Domain-specific: Rubrics for coding agents differ completely from research agents

{
  "rubric": {
    "task_completion": {
      "question": "Did the agent complete the requested task?",
      "weight": 0.4,
      "criteria": [
        "All required outputs are present",
        "Outputs match expected format",
        "No critical information missing"
      ]
    },
    "tool_usage": {
      "question": "Were tools used appropriately?",
      "weight": 0.3,
      "criteria": [
        "Correct tool selected for each sub-task",
        "No redundant tool calls",
        "Error conditions handled gracefully"
      ]
    },
    "reasoning_quality": {
      "question": "Is the reasoning chain logical and efficient?",
      "weight": 0.3,
      "criteria": [
        "Clear task decomposition",
        "No circular reasoning",
        "Appropriate use of context"
      ]
    }
  }
}

Mitigating LLM-Judge Bias

Research shows error rates can exceed 50% without proper bias handling. Three common biases:

Bias	Description	Mitigation
Position Bias	Judge favors the response appearing first in A/B comparisons	Randomize presentation order
Length Bias	Longer responses score higher even when content isn't better	Add instruction "brevity is preferred when correct"
Agreeableness Bias	Judge tends to agree with responses rather than critique them	Ensemble: run N judge instances, take majority vote

Critical Warning

Before deploying LLM-as-Judge to production, validate by measuring Spearman correlation against 100–200 human-scored samples. Target a minimum of 0.80+ before trusting automated judgment. Below this threshold, the judge is making unreliable decisions.

Pass@k and Pass^k: Two Faces of Reliability

Agent evaluation introduces two metrics not typically needed for standard LLM evaluation:

Pass@k — Probability of success at least once in k attempts

Suitable for use cases where users can retry: chatbots, code generation, search. If pass@3 = 95%, the agent succeeds within 3 attempts for 95% of cases.

Pass^k — Probability of success on ALL k attempts

Suitable for critical use cases: financial transactions, deployment automation, medical decisions. If pass^5 = 90%, the agent succeeds all 5/5 times for 90% of cases — measuring true reliability.

An agent with pass@1 = 85% sounds acceptable, but pass^5 drops to only ~44% — meaning nearly half of all tasks will fail at least once across 5 runs. This is critical insight for production systems.

Popular AI Agent Benchmarks (2026)

Benchmark	Domain	Characteristics	Best For
SWE-bench Verified	Coding	Real bug fixes from GitHub issues, verified by test suites	Coding agents, PR automation
GAIA	General reasoning	Multi-step questions requiring multiple tools	General-purpose agents
WebArena	Web automation	Navigation, form filling, transactions on the web	Browser agents, RPA
AgentBench	Multi-domain	8 different environments, measures robustness	Cross-domain agents
Humanity's Last Exam	Expert knowledge	Extremely difficult questions from domain experts	Frontier model capabilities
ARC-AGI-3	Abstraction	Pattern recognition, novel reasoning	Reasoning capabilities

Recommendation

Use 2–4 complementary benchmarks rather than relying on just one. Enterprise agents should combine: 1 domain-specific benchmark + 1 general reasoning + custom evals from real production cases.

Integrating Evaluation into CI/CD Pipelines

graph TD
    A[Code Change / Model Update] --> B{Trigger Type}
    B -->|Commit-based| C[Run Unit Evals]
    B -->|Schedule-based| D[Run Full Benchmark Suite]
    B -->|Event-driven| E[Run Diagnostic Eval]
    C --> F[Lightweight Checks: 100% traffic]
    D --> G[LLM-Judge: 5-10% sample]
    E --> H[Deep Analysis: Flagged cases]
    F --> I{Pass Gate?}
    G --> I
    H --> I
    I -->|Dev: 70%| J[Merge to Staging]
    I -->|Staging: 85%| K[Canary Deploy]
    I -->|Production: 95%| L[Full Rollout]
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style J fill:#ff9800,stroke:#fff,color:#fff
    style K fill:#4CAF50,stroke:#fff,color:#fff
    style L fill:#4CAF50,stroke:#fff,color:#fff

CI/CD pipeline with evaluation integration and progressive deployment gates

Progressive Deployment Gates

Set increasing performance thresholds per environment:

Development (70%): Allows experimentation and fast iteration
Staging (85%): Must approach production quality
Production (95%): Only deploy when exceeding the highest threshold

Continuous Evaluation Strategy

from deepeval import evaluate
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolCorrectnessMetric,
    GEval
)
from deepeval.test_case import LLMTestCase

# Define custom rubric-based metric
coherence_metric = GEval(
    name="Agent Coherence",
    criteria="""Evaluate whether the agent's reasoning chain is:
    1. Logically connected step-to-step
    2. Free of contradictions
    3. Efficient (no unnecessary loops)""",
    evaluation_params=[
        LLMTestCase.actual_output,
        LLMTestCase.expected_output
    ],
    threshold=0.7
)

# Production evaluation on sampled traffic
@scheduled(cron="0 */6 * * *")  # Every 6 hours
def run_production_eval():
    recent_traces = sample_production_traces(n=50)
    results = evaluate(
        test_cases=recent_traces,
        metrics=[
            TaskCompletionMetric(threshold=0.9),
            ToolCorrectnessMetric(threshold=0.85),
            coherence_metric
        ]
    )
    if results.overall_score < 0.85:
        alert_team(results)
        trigger_deep_eval(recent_traces)

Frameworks and Evaluation Tools

Framework	Key Features	Best For
DeepEval	Open-source, 50+ built-in metrics, tracing with @observe decorator	Teams wanting self-hosted, custom metrics
Braintrust	Managed platform, real-time scoring, dataset management	Teams needing quick production monitoring
Galileo	Rubric-based evaluation, agent-specific metrics, guardrails	Enterprise needing compliance + observability
MLflow	MLOps pipeline integration, experiment tracking, model registry	Teams already using MLflow for ML workflows
Arize Phoenix	Tracing + evaluation, LLM observability, drift detection	Teams needing full observability stack

Evaluation by Agent Type

Coding Agents

Use deterministic test suites (unit test pass/fail) combined with transcript analysis for code quality. SWE-bench Verified is the gold standard — fixing real bugs from OSS repos with ground truth tests.

Conversational Agents

Combine state verification (does the agent remember context correctly?) + LLM rubrics for tone, empathy, and helpfulness. Use simulated user personas to generate diverse test traffic.

Research Agents

Evaluate three factors: groundedness (are sources cited accurately?), coverage (is critical information missing?), and source quality (are sources trustworthy?). The hardest to evaluate because ground truth often doesn't exist.

Computer Use Agents

Verify UI state changes via screenshots or DOM inspection. Must also evaluate backend outcomes (was the action actually executed?) — not just visual state.

Implementation Roadmap for Teams

Week 1–2: Bootstrap

Start with 20–50 test cases sourced from real production failures, don't wait for a comprehensive test suite. Each case needs a reference solution and clear success criteria.

Week 3–4: Automate

Set up CI/CD integration: run evals automatically on every commit. Combine code-based graders (fast, objective) + 1–2 LLM-judge metrics (flexible). Target agreement ≥ 0.80 with human eval.

Week 5–8: Production

Deploy continuous evaluation: LLM-judge on 5–10% of production traffic, lightweight checks on 100%. Set up alerting when metrics drop. Implement canary deployment gates.

Ongoing: Iterate

Review the eval suite monthly. Add cases from new failure modes. Monitor eval saturation — when the agent passes 99% consistently, the eval is no longer useful, increase difficulty.

Anti-patterns to Avoid

Don't Make These Mistakes

Only measuring outcomes, ignoring trajectory: You know the agent fails but can't tell why → can't fix it
Grading steps instead of outputs: Rejecting creative but valid paths that still produce correct results
Eval on synthetic data only: The 37% gap between lab and production is real — use real failure data
Trusting LLM-Judge without validation: Running a judge without calibrating vs human = false confidence
Single benchmark reliance: Every benchmark has blind spots — use 2–4 complementary ones

Conclusion

AI Agent evaluation isn't a "nice-to-have" — it's a hard requirement for production deployment. 2026 is the year every AI team must invest seriously in evaluation, reliability, and optimization. Start small with 20 test cases from real failures, progressively automate with LLM-as-Judge, and scale into a continuous evaluation pipeline. Combine trajectory + outcome metrics, validate judges against human correlation, and establish progressive gates for deployment.

Good evaluation doesn't just help you ship better agents — it gives you the confidence to ship faster.

References

#AI Agent #LLM #Evaluation #Testing #CI/CD #DeepEval

# AI Agent Evaluation — Testing and Scoring AI Agents in Production

## Why Evaluating AI Agents Is Harder Than Evaluating Standalone LLMs

A standalone LLM receives a prompt and returns a response. You can grade it with BLEU, ROUGE, or human review. But AI Agents are different: they **reason across multiple steps**, call tools, receive results, then continue reasoning — a long chain of actions where any single mistake causes cascade failure.

37% Gap between benchmark and production performance

74% Production agents still need human-in-the-loop evaluation

80% Agreement between LLM-Judge and human raters

500–5000x Cost savings compared to human review

#### The Core Challenge

AI Agents are *non-deterministic* — the same input can produce 10 different execution paths across 10 runs, all potentially valid. Evaluation must assess both **trajectory** (the path taken) and **outcome** (the final result), not just one or the other.

## Evaluation Architecture: The Three-Layer Framework

```
graph TD
    A[AI Agent System] --> B[Reasoning Layer]
    A --> C[Action Layer]
    A --> D[Overall Execution]
    B --> B1[Plan Quality]
    B --> B2[Plan Adherence]
    B --> B3[Task Decomposition]
    C --> C1[Tool Selection]
    C --> C2[Argument Correctness]
    C --> C3[Error Handling]
    D --> D1[Task Completion]
    D --> D2[Step Efficiency]
    D --> D3[Response Quality]
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style C fill:#2c3e50,stroke:#fff,color:#fff
    style D fill:#2c3e50,stroke:#fff,color:#fff
    style B1 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B2 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B3 fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style C1 fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style C2 fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style C3 fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style D1 fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style D2 fill:#f8f9fa,stroke:#ff9800,color:#2c3e50
    style D3 fill:#f8f9fa,stroke:#ff9800,color:#2c3e50

```
Three evaluation layers for AI Agents: Reasoning, Action, and Overall Execution

#### Layer 1 — Reasoning Layer

Evaluates planning quality and task decomposition. Does the agent produce a logical plan? Does it adhere to that plan during execution?

- **PlanQualityMetric:** Is the plan complete, logical, and feasible?
- **PlanAdherenceMetric:** Does the agent deviate from its initial plan?
- **TaskDecomposition:** Is the complex task broken into appropriate sub-tasks?

#### Layer 2 — Action Layer

Evaluates tool calling — did the agent select the right tool, pass correct arguments, and handle errors appropriately?

- **ToolCorrectnessMetric:** Was the appropriate tool selected for the context?
- **ArgumentCorrectnessMetric:** Are arguments valid, complete, and correctly typed?
- **ErrorRecovery:** When a tool fails, does the agent retry/fallback appropriately?

#### Layer 3 — Overall Execution

Evaluates final results and overall efficiency.

- **TaskCompletionMetric:** Was the task completed according to requirements?
- **StepEfficiencyMetric:** Are there redundant steps or meaningless loops?
- **ResponseQuality:** Is the final output accurate, complete, and useful?

## Trajectory Metrics vs Outcome Metrics

These two evaluation approaches complement each other:

| Criteria | Trajectory Metrics | Outcome Metrics |
| --- | --- | --- |
| **What it measures** | The complete execution path — every reasoning step, tool call, decision | The final result — did the task complete correctly? |
| **Strength** | Reveals *why* an agent failed | Simple, directly measures business value |
| **Weakness** | May reject creative but valid paths | No insight into failure cause |
| **When to use** | Debugging, development, optimizing agent behavior | Production monitoring, regression testing |
| **Example** | Agent chose search tool before trying SQL query (wrong order) | Agent returned correct results on 95% of 1000 test cases |

#### Best Practice

Use **outcome metrics** as the primary signal in production (pass/fail). When outcome metrics drop, use **trajectory metrics** to debug root cause. Never rely on just one — combine both for a comprehensive picture.

## LLM-as-Judge: When AI Scores AI

This method uses a powerful LLM (typically Claude or GPT-4) as a "judge" to evaluate agent output. It achieves ~80% agreement with human raters at 500–5000x lower cost.

```
graph LR
    A[Agent Output + Context] --> B[Judge LLM]
    C[Evaluation Rubric] --> B
    D[Few-shot Examples] --> B
    B --> E[Structured Score + Reasoning]
    E --> F{Pass Threshold?}
    F -->|Yes| G[Deploy/Continue]
    F -->|No| H[Flag for Review]
    style A fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style B fill:#e94560,stroke:#fff,color:#fff
    style C fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style D fill:#f8f9fa,stroke:#4CAF50,color:#2c3e50
    style E fill:#2c3e50,stroke:#fff,color:#fff
    style G fill:#4CAF50,stroke:#fff,color:#fff
    style H fill:#ff9800,stroke:#fff,color:#fff

```
LLM-as-Judge pipeline: Agent output scored by Judge LLM based on rubric and examples

### Designing Effective Rubrics

The rubric is the single most important factor for LLM-Judge quality. A good rubric needs:

- **Specific:** Convert every criterion into measurable yes/no questions
- **Evidence-based:** Require the judge to cite evidence from the output
- **Hierarchical:** Organize in tiers (7 dimensions → 25 sub-dimensions → 130 items)
- **Domain-specific:** Rubrics for coding agents differ completely from research agents

```json
{
  "rubric": {
    "task_completion": {
      "question": "Did the agent complete the requested task?",
      "weight": 0.4,
      "criteria": [
        "All required outputs are present",
        "Outputs match expected format",
        "No critical information missing"
      ]
    },
    "tool_usage": {
      "question": "Were tools used appropriately?",
      "weight": 0.3,
      "criteria": [
        "Correct tool selected for each sub-task",
        "No redundant tool calls",
        "Error conditions handled gracefully"
      ]
    },
    "reasoning_quality": {
      "question": "Is the reasoning chain logical and efficient?",
      "weight": 0.3,
      "criteria": [
        "Clear task decomposition",
        "No circular reasoning",
        "Appropriate use of context"
      ]
    }
  }
}
```

### Mitigating LLM-Judge Bias

Research shows error rates can exceed 50% without proper bias handling. Three common biases:

| Bias | Description | Mitigation |
| --- | --- | --- |
| **Position Bias** | Judge favors the response appearing first in A/B comparisons | Randomize presentation order |
| **Length Bias** | Longer responses score higher even when content isn't better | Add instruction "brevity is preferred when correct" |
| **Agreeableness Bias** | Judge tends to agree with responses rather than critique them | Ensemble: run N judge instances, take majority vote |

#### Critical Warning

Before deploying LLM-as-Judge to production, validate by measuring **Spearman correlation** against 100–200 human-scored samples. Target a minimum of **0.80+** before trusting automated judgment. Below this threshold, the judge is making unreliable decisions.

## Pass@k and Pass^k: Two Faces of Reliability

Agent evaluation introduces two metrics not typically needed for standard LLM evaluation:

#### Pass@k — Probability of success at least once in k attempts

Suitable for use cases where users can retry: chatbots, code generation, search. If pass@3 = 95%, the agent succeeds within 3 attempts for 95% of cases.

#### Pass^k — Probability of success on ALL k attempts

Suitable for critical use cases: financial transactions, deployment automation, medical decisions. If pass^5 = 90%, the agent succeeds all 5/5 times for 90% of cases — measuring true *reliability*.

## Popular AI Agent Benchmarks (2026)

| Benchmark | Domain | Characteristics | Best For |
| --- | --- | --- | --- |
| **SWE-bench Verified** | Coding | Real bug fixes from GitHub issues, verified by test suites | Coding agents, PR automation |
| **GAIA** | General reasoning | Multi-step questions requiring multiple tools | General-purpose agents |
| **WebArena** | Web automation | Navigation, form filling, transactions on the web | Browser agents, RPA |
| **AgentBench** | Multi-domain | 8 different environments, measures robustness | Cross-domain agents |
| **Humanity's Last Exam** | Expert knowledge | Extremely difficult questions from domain experts | Frontier model capabilities |
| **ARC-AGI-3** | Abstraction | Pattern recognition, novel reasoning | Reasoning capabilities |

#### Recommendation

Use 2–4 complementary benchmarks rather than relying on just one. Enterprise agents should combine: 1 domain-specific benchmark + 1 general reasoning + custom evals from real production cases.

## Integrating Evaluation into CI/CD Pipelines

```
graph TD
    A[Code Change / Model Update] --> B{Trigger Type}
    B -->|Commit-based| C[Run Unit Evals]
    B -->|Schedule-based| D[Run Full Benchmark Suite]
    B -->|Event-driven| E[Run Diagnostic Eval]
    C --> F[Lightweight Checks: 100% traffic]
    D --> G[LLM-Judge: 5-10% sample]
    E --> H[Deep Analysis: Flagged cases]
    F --> I{Pass Gate?}
    G --> I
    H --> I
    I -->|Dev: 70%| J[Merge to Staging]
    I -->|Staging: 85%| K[Canary Deploy]
    I -->|Production: 95%| L[Full Rollout]
    style A fill:#e94560,stroke:#fff,color:#fff
    style B fill:#2c3e50,stroke:#fff,color:#fff
    style J fill:#ff9800,stroke:#fff,color:#fff
    style K fill:#4CAF50,stroke:#fff,color:#fff
    style L fill:#4CAF50,stroke:#fff,color:#fff

```
CI/CD pipeline with evaluation integration and progressive deployment gates

### Progressive Deployment Gates

Set increasing performance thresholds per environment:

- **Development (70%):** Allows experimentation and fast iteration
- **Staging (85%):** Must approach production quality
- **Production (95%):** Only deploy when exceeding the highest threshold

### Continuous Evaluation Strategy

```python
from deepeval import evaluate
from deepeval.metrics import (
    TaskCompletionMetric,
    ToolCorrectnessMetric,
    GEval
)
from deepeval.test_case import LLMTestCase

# Define custom rubric-based metric
coherence_metric = GEval(
    name="Agent Coherence",
    criteria="""Evaluate whether the agent's reasoning chain is:
    1. Logically connected step-to-step
    2. Free of contradictions
    3. Efficient (no unnecessary loops)""",
    evaluation_params=[
        LLMTestCase.actual_output,
        LLMTestCase.expected_output
    ],
    threshold=0.7
)

# Production evaluation on sampled traffic
@scheduled(cron="0 */6 * * *")  # Every 6 hours
def run_production_eval():
    recent_traces = sample_production_traces(n=50)
    results = evaluate(
        test_cases=recent_traces,
        metrics=[
            TaskCompletionMetric(threshold=0.9),
            ToolCorrectnessMetric(threshold=0.85),
            coherence_metric
        ]
    )
    if results.overall_score < 0.85:
        alert_team(results)
        trigger_deep_eval(recent_traces)

```

## Frameworks and Evaluation Tools

| Framework | Key Features | Best For |
| --- | --- | --- |
| **DeepEval** | Open-source, 50+ built-in metrics, tracing with @observe decorator | Teams wanting self-hosted, custom metrics |
| **Braintrust** | Managed platform, real-time scoring, dataset management | Teams needing quick production monitoring |
| **Galileo** | Rubric-based evaluation, agent-specific metrics, guardrails | Enterprise needing compliance + observability |
| **MLflow** | MLOps pipeline integration, experiment tracking, model registry | Teams already using MLflow for ML workflows |
| **Arize Phoenix** | Tracing + evaluation, LLM observability, drift detection | Teams needing full observability stack |

## Evaluation by Agent Type

### Coding Agents

### Conversational Agents

Combine state verification (does the agent remember context correctly?) + LLM rubrics for tone, empathy, and helpfulness. Use simulated user personas to generate diverse test traffic.

### Research Agents

Evaluate three factors: **groundedness** (are sources cited accurately?), **coverage** (is critical information missing?), and **source quality** (are sources trustworthy?). The hardest to evaluate because ground truth often doesn't exist.

### Computer Use Agents

Verify UI state changes via screenshots or DOM inspection. Must also evaluate backend outcomes (was the action actually executed?) — not just visual state.

## Implementation Roadmap for Teams

Week 1–2: Bootstrap

Start with 20–50 test cases sourced from **real production failures**, don't wait for a comprehensive test suite. Each case needs a reference solution and clear success criteria.

Week 3–4: Automate

Set up CI/CD integration: run evals automatically on every commit. Combine code-based graders (fast, objective) + 1–2 LLM-judge metrics (flexible). Target agreement ≥ 0.80 with human eval.

Week 5–8: Production

Deploy continuous evaluation: LLM-judge on 5–10% of production traffic, lightweight checks on 100%. Set up alerting when metrics drop. Implement canary deployment gates.

Ongoing: Iterate

Review the eval suite monthly. Add cases from new failure modes. Monitor eval saturation — when the agent passes 99% consistently, the eval is no longer useful, increase difficulty.

## Anti-patterns to Avoid

#### Don't Make These Mistakes

- **Only measuring outcomes, ignoring trajectory:** You know the agent fails but can't tell why → can't fix it
- **Grading steps instead of outputs:** Rejecting creative but valid paths that still produce correct results
- **Eval on synthetic data only:** The 37% gap between lab and production is real — use real failure data
- **Trusting LLM-Judge without validation:** Running a judge without calibrating vs human = false confidence
- **Single benchmark reliance:** Every benchmark has blind spots — use 2–4 complementary ones

## Conclusion

Good evaluation doesn't just help you ship better agents — it gives you the **confidence** to ship faster.

### References

- [Anthropic — Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
- [DeepEval — AI Agent Evaluation Guide](https://deepeval.com/guides/guides-ai-agent-evaluation)
- [Galileo — Agent Evaluation Framework: Metrics, Rubrics & Benchmarks](https://galileo.ai/blog/agent-evaluation-framework-metrics-rubrics-benchmarks)
- [Adaline — Complete Guide to LLM & AI Agent Evaluation 2026](https://www.adaline.ai/blog/complete-guide-llm-ai-agent-evaluation-2026)
- [ArXiv — When AIs Judge AIs: The Rise of Agent-as-a-Judge](https://arxiv.org/html/2508.02994v1)

Cloudflare Durable Objects — Stateful Edge Computing without Servers

Google ADK — The Open-Source Framework for Building Production AI Agents

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.