AI Agent Observability 2026: How Do You Know Your Agent Works?

Posted on: 6/4/2026 1:13:22 AM

Table of contents

1. Why evaluating an Agent is harder than evaluating a Model
1. The most common mistake
2. The 2026 shift: from "right answer" to "right trajectory"
3. Anatomy of a Trace: the agent span tree
4. OpenTelemetry GenAI: standardize to avoid vendor lock-in
5. The five-layer observability stack
6. The metrics that actually matter
1. The golden rule
7. LLM-as-Judge: using a model to grade a model
1. Tip: judge online and offline with the same rubric
8. Offline Eval and Online Monitoring: two sides of one coin
9. Tool comparison: which platform to pick?
1. A concise framing
10. Getting hands-on: a real instrumentation example
11. Common traps
12. Conclusion
1. References

You spend two weeks building an AI agent. The demo is flawless: it reads the email, looks up the order, calls the right API, answers coherently. Your boss nods, the project ships to production. Three weeks later a customer complains that the agent confidently described a refund policy that does not exist. You open the logs and find an endless ocean of JSON. The simple question — "what happened?" — has become nearly impossible to answer.

This is exactly the problem AI agent observability was born to solve. In traditional software, a line of code that runs correctly today runs correctly tomorrow. With agents, the same question can travel five different paths, call different tools, and produce different results on every run. This article dissects why evaluating agents is so hard, and how the 2026 toolkit — OpenTelemetry GenAI tracing, trajectory-level evaluation, and LLM-as-Judge — hands you back control.

~70%of Agentic AI projects fail on missing evaluation & monitoring, not on a weak model

5 layersin a complete agent observability stack (2026)

74.3%success rate of the best agent on WebArena — still below the human bar (78%)

<15%overhead added by tracing SDKs (Langfuse, AgentOps) in production

1. Why evaluating an Agent is harder than evaluating a Model

Most engineers come to agents from the machine-learning world, where evaluation is largely solved: you have a test set, a metric (accuracy, F1, BLEU...), run it once, get a number. Agents break every one of those assumptions, for four reasons.

Non-determinism: The same input can produce two different action sequences across two runs. There is no single "golden answer".
Multi-step: A task may span 3, 10 or 40 steps. The failure rarely lives in the "final answer" — it usually lives at step 7, where the agent calls the wrong tool, tries to fix it, and wanders off.
The path matters as much as the result: An agent can reach the right answer but take 25 wasted steps, burn 4x the tokens, and call an expensive API three times. "Correct" but costly and slow is still an operational failure.
The tool and memory layers: An agent doesn't just emit text — it calls functions, queries vector DBs, reads and writes memory. Each link is a potential point of failure, and they interact in unpredictable ways.

The most common mistake

Treating the agent as a "text-emitting black box" and scoring only the final answer. An agent can be right by luck, or wrong after a flawless reasoning chain that breaks on the last step. If you only look at the output, you can never tell those two cases apart — and you can never fix the root cause.

2. The 2026 shift: from "right answer" to "right trajectory"

This is the most important mindset change of the year. In 2024–2025, most teams still scored agents with single-turn metrics: "does the answer match ground truth?". By 2026, the unit of evaluation has become the trajectory — the entire path the agent took.

Instead of asking "did the model answer correctly?", the real operational question has become: "which step failed, under which tool call, with which prompt version, which retrieval context, at what latency and cost?". You grade which tools the agent picked, whether it recovered from a failed call, and how many wasted steps it took.

flowchart LR
    A["Single-turn eval
(2024-2025)"] -->|"scores only
final output"| B["Pass / Fail"]
    C["Trajectory eval
(2026)"] -->|"scores the
whole path"| D["Tool selection"]
    C --> E["Recovery ability"]
    C --> F["Wasted steps"]
    C --> G["Cost & latency
per step"]
    C --> H["Final output"]
    style A fill:#f8f9fa,stroke:#888,color:#2c3e50
    style C fill:#e94560,stroke:#fff,color:#fff
    style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style F fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H fill:#f8f9fa,stroke:#e94560,color:#2c3e50

Figure 1 — The unit of evaluation shifts from "output" to "trajectory": you grade the whole journey, not just the destination.

3. Anatomy of a Trace: the agent span tree

The foundation of all observability is the trace. A trace describes one complete agent request, organized as a tree of spans — each span is a unit of work with a start/end time, attributes, and parent-child relationships. Borrowed directly from distributed tracing (Jaeger, Zipkin), but extended for LLM semantics.

flowchart TD
    R["TRACE: 'Cancel order #4821 and refund'"] --> S1["Span: agent.invoke
(root, 8.2s, $0.04)"]
    S1 --> S2["Span: llm.chat — turn 1
decide to call tool"]
    S1 --> S3["Span: tool.lookup_order
(120ms)"]
    S1 --> S4["Span: llm.chat — turn 2
read result, plan"]
    S1 --> S5["Span: tool.process_refund
(340ms)  WRONG ARGS"]
    S1 --> S6["Span: llm.chat — turn 3
self-correct & retry"]
    S1 --> S7["Span: tool.process_refund
(310ms)  OK"]
    S1 --> S8["Span: llm.chat — turn 4
compose answer"]
    style R fill:#2c3e50,stroke:#fff,color:#fff
    style S1 fill:#e94560,stroke:#fff,color:#fff
    style S5 fill:#fff3e0,stroke:#ff9800,color:#2c3e50
    style S2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S4 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S6 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S7 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S8 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

Figure 2 — A trace of one task. From the span tree you immediately see the agent passed wrong arguments on the first refund call, caught it itself, and retried. The "correct" final output completely hides this incident.

Each span carries attributes — and this is where the gold lives, in the details. An llm.chat span records the model, prompt, completion, input/output token counts, cost, latency, temperature. A tool.* span records the tool name, input arguments, returned result, and any error. When every span is labeled consistently, you can query a million traces to answer things like "which tool fails most often?" or "which prompt version increases wasted steps?".

4. OpenTelemetry GenAI: standardize to avoid vendor lock-in

The problem in the early days was that every platform defined spans its own way. Your traces were locked tightly into LangSmith, or Langfuse, or some proprietary SDK. OpenTelemetry GenAI Semantic Conventions exist to fix this: one standard, vendor-neutral set of span schemas and attribute names so every tool "speaks the same language".

2023 — 2024

The chaos era: each platform (LangSmith, Langfuse, Arize, Weights & Biases) defined its own trace structure. Switching vendors meant rewriting all your instrumentation.

2024

OpenTelemetry launches the GenAI Semantic Conventions working group, defining gen_ai.* attributes for LLM client spans: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens...

Late 2025

Expansion from "LLM call" to "agent": adding the notions of agent span, tool execution span, and events to capture prompt/completion content.

Early 2026

The LLM client span group exits "experimental" and reaches stability. Datadog, New Relic and Dynatrace support the GenAI conventions natively — agent observability enters a real era of standardization.

The practical payoff is huge: you instrument the agent once against the OTel standard, then freely export traces to any backend — from self-hosted Jaeger, to Langfuse, to Datadog — without changing a line of instrumentation code. This is the same philosophy that made OpenTelemetry a success in the microservices world, now applied to agents.

Span type	Typical attributes (gen_ai.*)	Answers what question?
LLM client span	system, request.model, usage.input_tokens, usage.output_tokens, response.finish_reason	How many tokens did this model call cost, which model, why did it stop?
Agent span	agent.name, agent.id, operation.name (invoke)	Which agent, which version is running this task?
Tool / execute span	tool.name, tool.call.arguments, tool.call.result	Which tool was called, with what args, returning what, with what error?
Events	gen_ai.system.message, gen_ai.user.message, gen_ai.choice	Full prompt/completion content for replay and re-evaluation.

5. The five-layer observability stack

A mature agent observability system in 2026 is organized into five layers, each addressing a distinct concern. Don't try to jump to layer 4 (evaluation) before layers 1–2 (collection & standardization) are solid.

flowchart TD
    L1["LAYER 1 — SDK & Instrumentation
OpenLLMetry, auto-instrument SDKs"] --> L2["LAYER 2 — Standards & span schema
OpenTelemetry GenAI Conventions"]
    L2 --> L3["LAYER 3 — Tracing & Replay
span tree, time-travel, step debugging"]
    L3 --> L4["LAYER 4 — Evaluation & Scoring
LLM-as-Judge, small-model judge, Ragas"]
    L4 --> L5["LAYER 5 — Cost & Operations
tokens, $, latency, alerts, dashboards"]
    style L1 fill:#16213e,stroke:#e94560,color:#fff
    style L2 fill:#1f4068,stroke:#e94560,color:#fff
    style L3 fill:#2c3e50,stroke:#e94560,color:#fff
    style L4 fill:#e94560,stroke:#fff,color:#fff
    style L5 fill:#4CAF50,stroke:#fff,color:#fff

Figure 3 — The five layers of agent observability. The more solid the lower layers, the more trustworthy the upper ones.

6. The metrics that actually matter

Once you have traces, the question is: what do you score? Below is the core metric set that 2026 production teams track. Note it spans all three layers: outcome, behavior, and operations.

Metric	What it measures	Why it matters
Task Success Rate	Did the agent complete the user's actual goal	The ultimate "bridge" metric — but on its own it never tells you why it failed.
Tool Selection Accuracy	Did the agent call the right tool with the right args at each step	An agent can call every tool correctly and still fail the task — and vice versa.
Trajectory Quality	Wasted steps, loops, recovery after failure	Catches agents that "wander" or get stuck in loops even when they eventually answer.
Faithfulness / Hallucination	Does the answer stay grounded in the provided context/data	Scored by LLM-as-Judge; catches confidently fabricated answers.
Cost per Step	Tokens and dollars per step, per tool, per task	"Correct" but burning 5x the budget is still an operational failure.
Latency per Tool	Latency and error rate of each tool across the trace tree	Pinpoints bottlenecks — usually a slow external API, not the model.

The golden rule

Never optimize a single metric in isolation. An agent that pushes Task Success to 95% by calling every available tool at every step will wreck Cost per Step. Good observability means looking at the whole table at once and understanding the trade-offs.

7. LLM-as-Judge: using a model to grade a model

For things with no "golden answer" — answer quality, grounding, tone — you can't write a simple assert. The dominant 2026 solution is LLM-as-Judge: use an LLM itself (often a stronger model, or a small specialized one) to score outputs against a rubric you define.

# LLM-as-Judge: score faithfulness on a 1-5 scale
JUDGE_PROMPT = """You are a judge evaluating an AI agent's answer.
Score ONLY based on the provided CONTEXT, do NOT use outside knowledge.

[CONTEXT]
{context}

[QUESTION]
{question}

[AGENT ANSWER]
{answer}

Score faithfulness from 1 (fully fabricated) to 5 (tightly grounded).
Return JSON: {{"score": <1-5>, "reason": ""}}"""

def judge_faithfulness(context, question, answer):
    resp = judge_client.chat(
        model="claude-haiku-4-5",          # small, cheap, good enough to judge
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            context=context, question=question, answer=answer)}],
        response_format={"type": "json_object"},
        temperature=0,                      # deterministic for reproducibility
    )
    return json.loads(resp.content)

Two survival rules when using LLM-as-Judge:

The judge needs evaluating too. Before trusting a judge, calibrate its scores against a small human-graded set. A miscalibrated judge is more dangerous than no judge, because it gives you false confidence.
Tier it to save cost. The common 2026 pattern combines rule-based checks for what code can verify (is the JSON valid, does the tool exist) with small-model judges for the semantic part — achieving 100% production coverage at acceptable cost, instead of invoking a large model for every trace.

Tip: judge online and offline with the same rubric

Use the exact same judge for both offline eval (on a dataset, in CI) and online monitoring (sampling production traces). When they share a rubric, offline and online scores become comparable — you'll know instantly whether the model "drops" once it hits the real world.

8. Offline Eval and Online Monitoring: two sides of one coin

Don't conflate these two. Offline eval grades the agent on a fixed dataset before deploy — like unit tests, run in CI, blocking the merge if the score drops. Online monitoring observes the agent on real traffic after deploy — sampling traces, scoring with a judge, firing alerts when a metric crosses a threshold.

flowchart LR
    subgraph OFF["OFFLINE — pre-deploy"]
        D["Golden dataset
(cases + expectations)"] --> RUN["Run agent"]
        RUN --> SC["Score
(judge + rules)"]
        SC --> GATE{"Score >= threshold?"}
        GATE -->|"No"| BLOCK["Block merge"]
        GATE -->|"Yes"| SHIP["Deploy"]
    end
    SHIP --> PROD["PRODUCTION"]
    subgraph ON["ONLINE — post-deploy"]
        PROD --> TR["Real traces"]
        TR --> SMP["Sample & score"]
        SMP --> ALERT{"Metric drifting?"}
        ALERT -->|"Yes"| PAGE["Alert + investigate"]
        ALERT -->|"No"| OK["Continue"]
    end
    PAGE -.->|"failed case becomes
a new test"| D
    style OFF fill:#f8f9fa,stroke:#e94560
    style ON fill:#f8f9fa,stroke:#4CAF50
    style PROD fill:#2c3e50,stroke:#fff,color:#fff
    style GATE fill:#e94560,stroke:#fff,color:#fff
    style ALERT fill:#4CAF50,stroke:#fff,color:#fff

Figure 4 — The closed loop: a failure caught in production becomes a new case in the offline dataset. This is how your test suite grows over time.

The dashed arrow is what most teams forget: every production incident should be turned into a new test case in the offline dataset. Without this feedback loop, you'll patch the same class of bug forever.

9. Tool comparison: which platform to pick?

The 2026 ecosystem has matured. There is no absolute "best" tool — only the one that fits your stack and constraints. The four most worth considering:

Tool	Strength	Best when
Langfuse	Open-source, strong self-host, OTel-native, keeps traces in your infra (ClickHouse-backed). ~15% overhead.	You need data residency, want self-hosting, or want to decouple from a specific framework.
LangSmith	Deep LangChain/LangGraph integration, near-zero overhead, clusters traces into "Insights".	Your stack is already built on LangChain/LangGraph and you want a seamless experience.
AgentOps	Strong at replay and multi-framework debugging, reconstructs each step of a session. ~12% overhead.	You run several different agent frameworks and need to "rewind" to debug.
Arize Phoenix	Open-source, strong on eval and drift detection, OTel-native at its core.	You prioritize evaluation & quality monitoring over time.

A concise framing

Someone summed up their roles neatly: Langfuse gives you the traces, LangSmith clusters them into insights, Braintrust lets you build eval datasets from them, and AgentOps lets you replay them. Many mature teams actually mix and match — and thanks to the OpenTelemetry GenAI standard, the switching cost between them is far lower than before.

10. Getting hands-on: a real instrumentation example

The best part of the OTel standard is that you usually don't write spans by hand — you just attach auto-instrumentation. Below is a minimal OpenTelemetry-style example for a Python agent, exporting traces to any OTLP-compatible backend.

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

# 1. Configure the provider, export to an OTLP-compatible backend (Langfuse, Jaeger...)
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="https://otel.your-backend.io/v1/traces"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-agent")

# 2. Wrap an agent run in a root span, set attributes per the gen_ai.* convention
def run_agent(task: str):
    with tracer.start_as_current_span("agent.invoke") as root:
        root.set_attribute("gen_ai.agent.name", "support-agent")
        root.set_attribute("gen_ai.operation.name", "invoke")
        for step in agent_loop(task):
            # 3. Each tool call is a child span
            with tracer.start_as_current_span(f"tool.{step.tool}") as ts:
                ts.set_attribute("gen_ai.tool.name", step.tool)
                ts.set_attribute("gen_ai.tool.call.arguments", json.dumps(step.args))
                result = step.execute()
                ts.set_attribute("gen_ai.tool.call.result", str(result)[:2000])
        return root

On .NET, the ecosystem has caught up too: System.Diagnostics.ActivitySource maps directly onto OpenTelemetry spans, and libraries like Microsoft.Extensions.AI emit telemetry following the gen_ai.* convention — meaning an agent written in .NET 10 can export traces in the same format as a Python agent, into the same dashboard.

11. Common traps

Logs instead of traces. Dumping an ocean of print() into a file is not observability. Without parent-child relationships between steps, you can never reconstruct the trajectory.
Trusting an uncalibrated judge. An LLM-as-Judge must be calibrated against human grading before it gates CI, otherwise you're blocking merges based on a model's random opinion.
Ignoring cost. Tracking accuracy without tracking cost per step is a recipe for a shocking API bill at month's end.
Not sampling in production. Judging 100% of traces with an LLM-as-Judge on high traffic can cost as much as running the agent itself. Sample smartly — prioritize traces with errors, high latency, or abnormal cost.
A broken feedback loop. Catching a production failure but never turning it into an offline test case — and you'll meet that exact failure again.

12. Conclusion

In 2024 the question was "which model is strongest?". In 2026, when every team has access to roughly comparable models, the deciding question has changed: "can you see what your agent is doing?". The teams that trace every step, score trajectories rather than just outputs, and close the loop from production incident back to dataset — those teams will operate reliable agents, while everyone else is still guessing across an ocean of JSON.

Observability is not something you bolt on last "if there's time". In the agentic era, it is the difference between an impressive demo and a production system you'd dare put in front of a customer. Instrument against the OpenTelemetry GenAI standard from your very first line of code — your future self will thank you.

References

#AI Agents #Agentic AI #Observability #LLMOps #OpenTelemetry #Evaluation

# AI Agent Observability 2026: How Do You Know Your Agent Works?

You spend two weeks building an AI agent. The demo is flawless: it reads the email, looks up the order, calls the right API, answers coherently. Your boss nods, the project ships to production. Three weeks later a customer complains that the agent confidently described a refund policy that *does not exist*. You open the logs and find an endless ocean of JSON. The simple question — **"what happened?"** — has become nearly impossible to answer.

This is exactly the problem **AI agent observability** was born to solve. In traditional software, a line of code that runs correctly today runs correctly tomorrow. With agents, the same question can travel five different paths, call different tools, and produce different results on every run. This article dissects why evaluating agents is so hard, and how the 2026 toolkit — **OpenTelemetry GenAI tracing, trajectory-level evaluation, and LLM-as-Judge** — hands you back control.

~70%of Agentic AI projects fail on missing evaluation & monitoring, not on a weak model

5 layersin a complete agent observability stack (2026)

74.3%success rate of the best agent on WebArena — still below the human bar (78%)

<15%overhead added by tracing SDKs (Langfuse, AgentOps) in production

## 1. Why evaluating an Agent is harder than evaluating a Model

- **Non-determinism:** The same input can produce two different action sequences across two runs. There is no single "golden answer".
- **Multi-step:** A task may span 3, 10 or 40 steps. The failure rarely lives in the "final answer" — it usually lives at step 7, where the agent calls the wrong tool, tries to fix it, and wanders off.
- **The path matters as much as the result:** An agent can reach the right answer but take 25 wasted steps, burn 4x the tokens, and call an expensive API three times. "Correct" but costly and slow is still an operational failure.
- **The tool and memory layers:** An agent doesn't just emit text — it calls functions, queries vector DBs, reads and writes memory. Each link is a potential point of failure, and they interact in unpredictable ways.

#### The most common mistake

Treating the agent as a "text-emitting black box" and scoring only the final answer. An agent can be right by *luck*, or wrong after a flawless reasoning chain that breaks on the last step. If you only look at the output, you can never tell those two cases apart — and you can never fix the root cause.

## 2. The 2026 shift: from "right answer" to "right trajectory"

Instead of asking *"did the model answer correctly?"*, the real operational question has become: *"which step failed, under which tool call, with which prompt version, which retrieval context, at what latency and cost?"*. You grade which tools the agent picked, whether it recovered from a failed call, and how many wasted steps it took.

```
flowchart LR
    A["Single-turn eval  
(2024-2025)"] -->|"scores only  
final output"| B["Pass / Fail"]
    C["Trajectory eval  
(2026)"] -->|"scores the  
whole path"| D["Tool selection"]
    C --> E["Recovery ability"]
    C --> F["Wasted steps"]
    C --> G["Cost & latency  
per step"]
    C --> H["Final output"]
    style A fill:#f8f9fa,stroke:#888,color:#2c3e50
    style C fill:#e94560,stroke:#fff,color:#fff
    style B fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style D fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style E fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style F fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style G fill:#f8f9fa,stroke:#e94560,color:#2c3e50
    style H fill:#f8f9fa,stroke:#e94560,color:#2c3e50

```

Figure 1 — The unit of evaluation shifts from "output" to "trajectory": you grade the whole journey, not just the destination.

## 3. Anatomy of a Trace: the agent span tree

The foundation of all observability is the **trace**. A trace describes one complete agent request, organized as a tree of **spans** — each span is a unit of work with a start/end time, attributes, and parent-child relationships. Borrowed directly from distributed tracing (Jaeger, Zipkin), but extended for LLM semantics.

```
flowchart TD
    R["TRACE: 'Cancel order #4821 and refund'"] --> S1["Span: agent.invoke  
(root, 8.2s, $0.04)"]
    S1 --> S2["Span: llm.chat — turn 1  
decide to call tool"]
    S1 --> S3["Span: tool.lookup_order  
(120ms)"]
    S1 --> S4["Span: llm.chat — turn 2  
read result, plan"]
    S1 --> S5["Span: tool.process_refund  
(340ms)  WRONG ARGS"]
    S1 --> S6["Span: llm.chat — turn 3  
self-correct & retry"]
    S1 --> S7["Span: tool.process_refund  
(310ms)  OK"]
    S1 --> S8["Span: llm.chat — turn 4  
compose answer"]
    style R fill:#2c3e50,stroke:#fff,color:#fff
    style S1 fill:#e94560,stroke:#fff,color:#fff
    style S5 fill:#fff3e0,stroke:#ff9800,color:#2c3e50
    style S2 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S3 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S4 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S6 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S7 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50
    style S8 fill:#f8f9fa,stroke:#e0e0e0,color:#2c3e50

```

Each span carries **attributes** — and this is where the gold lives, in the details. An `llm.chat` span records the model, prompt, completion, input/output token counts, cost, latency, temperature. A `tool.*` span records the tool name, input arguments, returned result, and any error. When every span is labeled consistently, you can query a million traces to answer things like "which tool fails most often?" or "which prompt version increases wasted steps?".

## 4. OpenTelemetry GenAI: standardize to avoid vendor lock-in

The problem in the early days was that every platform defined spans its own way. Your traces were locked tightly into LangSmith, or Langfuse, or some proprietary SDK. **OpenTelemetry GenAI Semantic Conventions** exist to fix this: one standard, vendor-neutral set of span schemas and attribute names so every tool "speaks the same language".

2023 — 2024

The chaos era: each platform (LangSmith, Langfuse, Arize, Weights & Biases) defined its own trace structure. Switching vendors meant rewriting all your instrumentation.

2024

OpenTelemetry launches the GenAI Semantic Conventions working group, defining `gen_ai.*` attributes for LLM client spans: `gen_ai.system`, `gen_ai.request.model`, `gen_ai.usage.input_tokens`...

Late 2025

Expansion from "LLM call" to "agent": adding the notions of **agent span**, **tool execution span**, and events to capture prompt/completion content.

Early 2026

The practical payoff is huge: you instrument the agent *once* against the OTel standard, then freely export traces to any backend — from self-hosted Jaeger, to Langfuse, to Datadog — without changing a line of instrumentation code. This is the same philosophy that made OpenTelemetry a success in the microservices world, now applied to agents.

| Span type | Typical attributes (gen_ai.*) | Answers what question? |
| --- | --- | --- |
| **LLM client span** | system, request.model, usage.input_tokens, usage.output_tokens, response.finish_reason | How many tokens did this model call cost, which model, why did it stop? |
| **Agent span** | agent.name, agent.id, operation.name (invoke) | Which agent, which version is running this task? |
| **Tool / execute span** | tool.name, tool.call.arguments, tool.call.result | Which tool was called, with what args, returning what, with what error? |
| **Events** | gen_ai.system.message, gen_ai.user.message, gen_ai.choice | Full prompt/completion content for replay and re-evaluation. |

## 5. The five-layer observability stack

```
flowchart TD
    L1["LAYER 1 — SDK & Instrumentation  
OpenLLMetry, auto-instrument SDKs"] --> L2["LAYER 2 — Standards & span schema  
OpenTelemetry GenAI Conventions"]
    L2 --> L3["LAYER 3 — Tracing & Replay  
span tree, time-travel, step debugging"]
    L3 --> L4["LAYER 4 — Evaluation & Scoring  
LLM-as-Judge, small-model judge, Ragas"]
    L4 --> L5["LAYER 5 — Cost & Operations  
tokens, $, latency, alerts, dashboards"]
    style L1 fill:#16213e,stroke:#e94560,color:#fff
    style L2 fill:#1f4068,stroke:#e94560,color:#fff
    style L3 fill:#2c3e50,stroke:#e94560,color:#fff
    style L4 fill:#e94560,stroke:#fff,color:#fff
    style L5 fill:#4CAF50,stroke:#fff,color:#fff

```

Figure 3 — The five layers of agent observability. The more solid the lower layers, the more trustworthy the upper ones.

## 6. The metrics that actually matter

Once you have traces, the question is: what do you score? Below is the core metric set that 2026 production teams track. Note it spans all three layers: outcome, behavior, and operations.

| Metric | What it measures | Why it matters |
| --- | --- | --- |
| **Task Success Rate** | Did the agent complete the user's actual goal | The ultimate "bridge" metric — but on its own it never tells you *why* it failed. |
| **Tool Selection Accuracy** | Did the agent call the right tool with the right args at each step | An agent can call every tool correctly and still fail the task — and vice versa. |
| **Trajectory Quality** | Wasted steps, loops, recovery after failure | Catches agents that "wander" or get stuck in loops even when they eventually answer. |
| **Faithfulness / Hallucination** | Does the answer stay grounded in the provided context/data | Scored by LLM-as-Judge; catches confidently fabricated answers. |
| **Cost per Step** | Tokens and dollars per step, per tool, per task | "Correct" but burning 5x the budget is still an operational failure. |
| **Latency per Tool** | Latency and error rate of each tool across the trace tree | Pinpoints bottlenecks — usually a slow external API, not the model. |

#### The golden rule

Never optimize a single metric in isolation. An agent that pushes Task Success to 95% by calling every available tool at every step will wreck Cost per Step. Good observability means looking at **the whole table at once** and understanding the trade-offs.

## 7. LLM-as-Judge: using a model to grade a model

For things with no "golden answer" — answer quality, grounding, tone — you can't write a simple `assert`. The dominant 2026 solution is **LLM-as-Judge**: use an LLM itself (often a stronger model, or a small specialized one) to score outputs against a rubric you define.

```python
# LLM-as-Judge: score faithfulness on a 1-5 scale
JUDGE_PROMPT = """You are a judge evaluating an AI agent's answer.
Score ONLY based on the provided CONTEXT, do NOT use outside knowledge.

[CONTEXT]
{context}

[QUESTION]
{question}

[AGENT ANSWER]
{answer}

Score faithfulness from 1 (fully fabricated) to 5 (tightly grounded).
Return JSON: {{"score": <1-5>, "reason": ""}}"""

def judge_faithfulness(context, question, answer):
    resp = judge_client.chat(
        model="claude-haiku-4-5",          # small, cheap, good enough to judge
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            context=context, question=question, answer=answer)}],
        response_format={"type": "json_object"},
        temperature=0,                      # deterministic for reproducibility
    )
    return json.loads(resp.content)

Two survival rules when using LLM-as-Judge:

- The judge needs evaluating too. Before trusting a judge, calibrate its scores against a small human-graded set. A miscalibrated judge is more dangerous than no judge, because it gives you false confidence.
- Tier it to save cost. The common 2026 pattern combines rule-based checks for what code can verify (is the JSON valid, does the tool exist) with small-model judges for the semantic part — achieving 100% production coverage at acceptable cost, instead of invoking a large model for every trace.

#### Tip: judge online and offline with the same rubric

## 8. Offline Eval and Online Monitoring: two sides of one coin

Don't conflate these two. Offline eval grades the agent on a fixed dataset before deploy — like unit tests, run in CI, blocking the merge if the score drops. Online monitoring observes the agent on real traffic after deploy — sampling traces, scoring with a judge, firing alerts when a metric crosses a threshold.

```
flowchart LR
    subgraph OFF["OFFLINE — pre-deploy"]
        D["Golden dataset  
(cases + expectations)"] --> RUN["Run agent"]
        RUN --> SC["Score  
(judge + rules)"]
        SC --> GATE{"Score >= threshold?"}
        GATE -->|"No"| BLOCK["Block merge"]
        GATE -->|"Yes"| SHIP["Deploy"]
    end
    SHIP --> PROD["PRODUCTION"]
    subgraph ON["ONLINE — post-deploy"]
        PROD --> TR["Real traces"]
        TR --> SMP["Sample & score"]
        SMP --> ALERT{"Metric drifting?"}
        ALERT -->|"Yes"| PAGE["Alert + investigate"]
        ALERT -->|"No"| OK["Continue"]
    end
    PAGE -.->|"failed case becomes  
a new test"| D
    style OFF fill:#f8f9fa,stroke:#e94560
    style ON fill:#f8f9fa,stroke:#4CAF50
    style PROD fill:#2c3e50,stroke:#fff,color:#fff
    style GATE fill:#e94560,stroke:#fff,color:#fff
    style ALERT fill:#4CAF50,stroke:#fff,color:#fff

```

Figure 4 — The closed loop: a failure caught in production becomes a new case in the offline dataset. This is how your test suite grows over time.

## 9. Tool comparison: which platform to pick?

The 2026 ecosystem has matured. There is no absolute "best" tool — only the one that fits your stack and constraints. The four most worth considering:

| Tool | Strength | Best when |
| --- | --- | --- |
| Langfuse | Open-source, strong self-host, OTel-native, keeps traces in your infra (ClickHouse-backed). ~15% overhead. | You need data residency, want self-hosting, or want to decouple from a specific framework. |
| LangSmith | Deep LangChain/LangGraph integration, near-zero overhead, clusters traces into "Insights". | Your stack is already built on LangChain/LangGraph and you want a seamless experience. |
| AgentOps | Strong at replay and multi-framework debugging, reconstructs each step of a session. ~12% overhead. | You run several different agent frameworks and need to "rewind" to debug. |
| Arize Phoenix | Open-source, strong on eval and drift detection, OTel-native at its core. | You prioritize evaluation & quality monitoring over time. |

#### A concise framing

Someone summed up their roles neatly: Langfuse gives you the traces, LangSmith clusters them into insights, Braintrust lets you build eval datasets from them, and AgentOps lets you replay them. Many mature teams actually mix and match — and thanks to the OpenTelemetry GenAI standard, the switching cost between them is far lower than before.

## 10. Getting hands-on: a real instrumentation example

```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

# 1. Configure the provider, export to an OTLP-compatible backend (Langfuse, Jaeger...)
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="https://otel.your-backend.io/v1/traces"))
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-agent")

# 2. Wrap an agent run in a root span, set attributes per the gen_ai.* convention
def run_agent(task: str):
    with tracer.start_as_current_span("agent.invoke") as root:
        root.set_attribute("gen_ai.agent.name", "support-agent")
        root.set_attribute("gen_ai.operation.name", "invoke")
        for step in agent_loop(task):
            # 3. Each tool call is a child span
            with tracer.start_as_current_span(f"tool.{step.tool}") as ts:
                ts.set_attribute("gen_ai.tool.name", step.tool)
                ts.set_attribute("gen_ai.tool.call.arguments", json.dumps(step.args))
                result = step.execute()
                ts.set_attribute("gen_ai.tool.call.result", str(result)[:2000])
        return root

```

On .NET, the ecosystem has caught up too: System.Diagnostics.ActivitySource maps directly onto OpenTelemetry spans, and libraries like Microsoft.Extensions.AI emit telemetry following the gen_ai.* convention — meaning an agent written in .NET 10 can export traces in the same format as a Python agent, into the same dashboard.

## 11. Common traps

- Logs instead of traces. Dumping an ocean of print() into a file is not observability. Without parent-child relationships between steps, you can never reconstruct the trajectory.
- Trusting an uncalibrated judge. An LLM-as-Judge must be calibrated against human grading before it gates CI, otherwise you're blocking merges based on a model's random opinion.
- Ignoring cost. Tracking accuracy without tracking cost per step is a recipe for a shocking API bill at month's end.
- Not sampling in production. Judging 100% of traces with an LLM-as-Judge on high traffic can cost as much as running the agent itself. Sample smartly — prioritize traces with errors, high latency, or abnormal cost.
- A broken feedback loop. Catching a production failure but never turning it into an offline test case — and you'll meet that exact failure again.

## 12. Conclusion

In 2024 the question was "which model is strongest?". In 2026, when every team has access to roughly comparable models, the deciding question has changed: "can you see what your agent is doing?". The teams that trace every step, score trajectories rather than just outputs, and close the loop from production incident back to dataset — those teams will operate reliable agents, while everyone else is still guessing across an ocean of JSON.

---

### References

- [Anthropic — Effective context engineering for AI agents](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents)
- [OpenTelemetry — GenAI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/)
- [Confident AI — LLM Agent Evaluation Metrics in 2026](https://www.confident-ai.com/blog/llm-agent-evaluation-complete-guide)
- [AWS — Evaluating AI agents: real-world lessons from building agentic systems at Amazon](https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/)
- [Coralogix — Agentic AI Observability: A Practical Guide for 2026](https://coralogix.com/ai-blog/agentic-ai-observability/)
- [Langfuse — Langfuse vs. LangSmith for LLM Observability](https://langfuse.com/faq/all/langsmith-alternative)

```

AI SRE 2026: When AI Agents Resolve Production Incidents

Voice AI Agents 2026: Building Real-Time Speech Agents

Disclaimer: The opinions expressed in this blog are solely my own and do not reflect the views or opinions of my employer or any affiliated organizations. The content provided is for informational and educational purposes only and should not be taken as professional advice. While I strive to provide accurate and up-to-date information, I make no warranties or guarantees about the completeness, reliability, or accuracy of the content. Readers are encouraged to verify the information and seek independent advice as needed. I disclaim any liability for decisions or actions taken based on the content of this blog.