How to Evaluate AI Agents in 2026

In short

Evaluating an AI agent in 2026 means testing the boundary between the agent and the tools it calls – not just the model's reasoning. Over 40% of agentic projects will be cancelled by 2027 (Gartner, June 2025), mostly over cost and weak controls. The fix: golden trajectories, replayable provenance, and tool-contract tests – which the EU AI Act now also requires.

I'll be straight with you. I've read most of the agent-evaluation guides that came out this year, and the good ones are beautifully written – and quietly solving a 2023 problem. They treat the agent as something you own: a prompt, a model, a loop you wrote yourself. Test the prompt, grade the answer, ship it.

That is not the world I keep seeing. The agents I watch break in production almost never break because the model reasoned badly.

So this is the guide I actually wanted to read: one with real numbers, with the economics nobody likes to put in writing, and with the part most engineers skip entirely – the law. Let’s get into it.

What does it mean to evaluate an AI agent in 2026?

To evaluate an agent today is to ask whether the whole system behaves – the model, the tools it calls, the data it retrieves, and the trail it leaves behind. The model is one component. Most of the failures I’ve chased live in the gaps between the others.

Here’s the reframing that finally made it click for me, from someone building exactly this kind of platform:

“In 2026 the agent isn't the system – it's the client, calling MCP endpoints across tools it doesn't own. Eval moves from 'did the model reason correctly?' to 'did the tools return what the agent expected, with provenance you can replay?' Under EU AI Act Article 13 – which applies to standalone high-risk systems from 2 December 2027 after the May 2026 Digital Omnibus – that replayable provenance is a legal requirement, not an engineering nicety.”

Anton Macius, CTO, Sphere IQ

And this isn't hand-waving. The Model Context Protocol (MCP), introduced by Anthropic in November 2024 and now stewarded by the Linux Foundation's Agentic AI Foundation, defines a literal client–server split. The agent is the client; the tools are servers it queries.

By early 2026 that ecosystem carried more than 97 million monthly SDK downloads and over 177,000 registered tools across 10,000-plus active servers.

When your agent leans on infrastructure at that scale, "just evaluate the model" stops being the right unit of analysis. You’re evaluating a supply chain.

Why do most agentic AI projects fail?

Mostly for commercial reasons, not because the model is dumb. This is the number that should make you sit up: Gartner predicts over 40% of agentic AI projects will be cancelled by the end of 2027, citing escalating costs, unclear business value, and inadequate risk controls – a finding Reuters also reported, drawn from a poll of 3,412 organisations.

And it's not just agents. MIT's Project NANDA found that 95% of enterprise generative-AI pilots delivered no measurable return on profit and loss, despite an estimated $30–40 billion in spend (MIT Project NANDA, The GenAI Divide, 2025). The Financial Times put it more bluntly: 95% of organisations are getting zero return. I've sat in enough “let's call it a learning” wrap-up meetings to believe it.

The enterprise generative-AI adoption funnel from MIT Project NANDA, showing the drop-off from pilots to measurable P&L return.

The enterprise generative-AI adoption funnel. Source: MIT Project NANDA, "The GenAI Divide: State of AI in Business 2025."

The enterprise generative-AI adoption funnel. Source: MIT Project NANDA, "The GenAI Divide: State of AI in Business 2025."

McKinsey hits the same wall, just from the scaling side. In a survey of 1,993 organisations, 23% said they'd scaled an agentic system somewhere in the business – but in no single function had more than 10% actually scaled agents.

Experimentation is cheap, and reliable, governed, measurable agents are rare. Evaluation is the discipline that gets you from the first to the second.

Agent evals vs. model evals: what actually changed?

A model eval asks a fixed question and grades a fixed answer. An agent eval has to grade a trajectory – the sequence of tool calls, retrievals, intermediate states, and side effects that produced the outcome. The answer can be right for the wrong reasons, or wrong because a tool fed the model stale data it had no reason to distrust.

Dimension	Model evaluation (2023 era)	Agent evaluation (2026)
Unit of analysis	A single prompt → response	A full trajectory of tool calls and state changes
What can fail	The model's reasoning	Reasoning plus retrieval, tool returns, permissions, provenance
Success metric	Accuracy on a test set	Task completion and consistency across repeated runs
Boundary	Self-contained	Spans tools the agent doesn't own
Evidence needed	The output	A replayable log: request → tool → response → outcome

The part teams consistently underestimate is the reliability math. Because the steps depend on each other, per-step accuracy multiplies. An agent that's 95% reliable on each step succeeds end-to-end only about 36% of the time across a 20-step chain. Read that twice.

Architecture diagram of an AI agent acting as a client that orchestrates external tools it does not own — calling MCP servers, retrieval, and APIs across a trust boundary.

This is exactly what the tool-use benchmarks now measure. On τ-bench, GPT-4o reached roughly 61% task success on retail tasks but only ~35% on airline tasks, and consistency across eight repeated runs fell to about 25%.

The follow-up benchmark made Anton’s point for him. When agents moved from a "full control" setting to one where they had to coordinate with users and tools they didn’t own, task-success rates dropped by up to 25 points – even for top-tier models (Sierra, τ²-bench, June 2025).

How do you evaluate an agent you don't fully control?

You evaluate the boundary. The agent's own reasoning matters, but what actually decides production behaviour is whether each tool returned what the agent expected – and whether you can prove it afterward. Here's how I'd structure it.

Golden trajectories, not golden answers. Pick 5–10 critical paths and assert on the whole run – the tool calls made, the order, the retrieved context, the final state – not just the text. A correct answer reached through a wrong tool call is a latent incident wearing a green checkmark.
Tool-contract tests. Treat every MCP endpoint as an external dependency with a contract: expected schema, latency, failure modes. Inject stale data, malformed responses, and timeouts, and confirm the agent degrades safely instead of confidently passing the bad value downstream.
Replayable provenance. Log request → tool → response → outcome for every step, in a form you can replay deterministically. This is the difference between "the agent failed" and "the CRM server returned a two-day-old record at 14:03 and the agent trusted it." One of those you can fix.

Grounding the agent in your own data is the highest-leverage reliability move here – retrieval that returns cited, current, permission-aware context shrinks the surface where a tool can mislead the model. That’s why I treat RAG grounding as an evaluation concern, not just an accuracy one.

It's also where the security boundary lives. More than 13,000 MCP servers launched on GitHub in 2025 alone, and the protocol spec itself doesn't enforce audit, sandboxing, or verification (Zenity, 2025). Provenance and screening are things you add. They are not free, and they are not optional.

What does the EU AI Act now require for agent evaluation?

This is the dimension most engineering guides skip entirely – and it just moved. Under the Digital Omnibus political agreement of 7 May 2026, obligations for standalone high-risk systems were deferred from 2 August 2026 to 2 December 2027.

The deadline moved; the substance didn’t. Article 13 still requires high-risk systems to be transparent enough that deployers can interpret outputs and use them correctly, and Articles 12 and 19 require record-keeping and automatically generated logs. In agent terms, that’s replayable provenance – by law, not by preference.

Obligation	Original date	Current date	Source
Standalone high-risk (Annex III)	2 Aug 2026	2 Dec 2027	Council of the EU, May 2026
High-risk embedded in products (Annex I)	2 Aug 2026	2 Aug 2028	Council of the EU, May 2026
Transparency for AI-generated content	2 Aug 2026	2 Dec 2026	Council of the EU, May 2026

The penalties are what keep this off the "later" pile: serious violations can reach €35 million or 7% of global annual turnover (European Commission). The extra runway isn’t a reason to relax – it’s time to build the logging, classification, and evidence pipeline an auditor will eventually ask for. Tooling like the SphereIQ platform exists to produce that audit-ready evidence by default.

What does agent evaluation cost – and what does failure cost?

Evaluation isn’t free, and pretending it is, is exactly how you end up in that 40% cancellation statistic. A reasonable rule of thumb from people doing this well: budget 10–20% of agent development time for evaluation and monitoring – reading traces, tuning signals, investigating real failures, not just writing test cases.

The economics of grounding, though, are friendlier than they look – which is the whole reason evaluation pays for itself. Teams that swapped fine-tuning for retrieval report deployments at roughly one-tenth the cost and in weeks instead of months.

Approach	Typical cost	Time to production	Stays current?
Fine-tuning a custom model	$200K+ per project	Months	No – retrain on change
Enterprise RAG (retrieval)	~1/10th of fine-tuning	6–8 weeks	Yes – at query time

Source: Sphere's enterprise RAG solution

The cost that dwarfs both is the unmeasured incident: the refund the agent approved wrongly, the policy it hallucinated, the PII it leaked through an unscreened tool call. Evaluation is cheaper than the headline. For teams building this end-to-end, Sphere treats provenance and governance as part of the build, not a bolt-on.

A practical agent-evaluation checklist

Define the frame – decide whether you’re raising the floor (reliability where it matters) or chasing a benchmark. Most production agents need the former.
Write 5–10 golden trajectories covering your critical paths, and assert on tool calls and final state, not just text.
Contract-test every tool / MCP endpoint for schema, latency, and failure handling.
Log replayable provenance for every request → tool → response → outcome.
Measure consistency, not just accuracy – track success across repeated runs (pass^k), not one happy path.
Screen the boundary for PII, prompt injection, and policy violations before inputs reach the model.
Map to the regulation – classify the system, retain logs, keep exportable evidence for EU AI Act obligations.
Budget 10–20% of build time for ongoing evaluation and monitoring.

FAQ

Evaluating an LLM grades a single response to a fixed prompt. Evaluating an agent grades a full trajectory – the tool calls, retrievals, intermediate states, and side effects – because in 2026 the agent is a client orchestrating external tools it doesn't own.

Per Gartner (June 2025), the main causes are escalating costs, unclear business value, and inadequate risk controls – not model quality. MIT and McKinsey data show the same wall: experimentation is easy; reliable, governed, measurable agents are rare.

Provenance is a replayable record of what happened at each step: which tool was called, what it returned, when, and what the agent did with it. It turns "the agent failed" into a diagnosable, auditable event – and the EU AI Act now requires it for high-risk systems.

Yes, where an agent is part of a high-risk system. After the May 2026 Digital Omnibus, standalone high-risk obligations (including Article 13 transparency) apply from 2 December 2027, with penalties up to €35M or 7% of global turnover.

A practical benchmark is 10–20% of agent development time spent on evaluation and monitoring – reading traces, tuning signals, and investigating real failures, not only writing eval cases.

Yes. Grounding an agent in cited, current, permission-aware data narrows the surface where a tool can mislead the model – and it costs roughly one-tenth of fine-tuning while staying current as data changes.

Sphere IQ

Platform Modules