Agentic AI Eval Research — Phase 1, Part 1 · March 2026

The Blind Spot in Every AI Eval Framework

RAGAS, DeepEval, LangSmith, TruLens — mature frameworks, genuinely useful. But they were built for RAG. This is what they systematically miss, why it matters at 97M+ monthly MCP downloads, and what the research confirms.

97M+ Monthly MCP
SDK Downloads
40% Enterprise AI with
Agents by 2026
14.2% Step Attribution
Accuracy (ICML 2025)
0 Frameworks
Solving This
Part of series: Part 1 — The Blind Spot Part 2 — Taxonomy & Solution
scroll
In this article
  1. The Eval Framework Blind Spot
  2. What the Research Confirms

Production agentic AI systems fail differently than chat-based AI. The major eval frameworks were architected for retrieval-augmented generation — they measure what the model says, not what tools do. This article establishes that gap: why it exists, how wide it is, and what the research landscape confirms. Part 2 maps the 19 failure modes in detail and proposes three evaluation primitives to close it.


The Eval Framework Blind Spot

RAGAS, DeepEval, LangSmith, TruLens — mature frameworks, genuinely useful. But they were built for RAG. The architecture assumptions don't map to tool execution, and the gaps are both systematic and undiscussed.

97M+
Monthly MCP SDK downloads as of early 2026. This is now the default standard for AI-tool integration.
Python + TypeScript SDKs combined
40%
Of enterprise applications will include agentic AI by end of 2026 — up from less than 5% in 2025.
Gartner, Aug 2025
14.2%
Step-level attribution accuracy achieved by the best tested method for identifying where in a chain a failure occurred.
Zhang et al., ICML 2025 Spotlight
✓ WHAT FRAMEWORKS COVER Model output quality Faithfulness, relevance, coherence RAG pipeline evaluation Context precision, retrieval accuracy Agent goal completion Did the task finish? End-to-end pass/fail Observability & tracing Spans, latency, token usage ✗ WHAT NO FRAMEWORK COVERS Semantic arg correctness SQL runs but returns wrong data — L2 Tool output trust validation Stale, injected, or partial results — L3 Chain-level failure attribution Which step caused failure? — L4 (14.2%) MCP protocol-level failures Tool selection, routing, injection via MCP
Fig. 1 — Coverage comparison. Every major eval framework covers the left column. None cover the right column — which maps directly to Layers 2, 3, and 4 of the taxonomy in Part 2.
PRACTITIONER ADMISSION

Arcade Evals (Feb 2026), an active MCP evaluation project, explicitly states their library is "intentionally scoped to tool selection and argument quality, without executing the tool — it doesn't validate what happens after the tool runs." This is Layer 3. Nobody is building it.


What the Research Confirms

This work doesn't exist in isolation. A substantial body of emerging research validates the gap from multiple angles — and confirms it remains unsolved.

ICML 2025 SPOTLIGHT · PENN STATE / DUKE
"Which Agent Causes Task Failures and When?"
Built the Who&When dataset across 127 multi-agent systems. Best tested attribution method achieves only 53.5% agent-level accuracy and 14.2% step-level accuracy. Authors conclude methods "fail to achieve practical usability." This is our Layer 4 baseline to beat.
LAYER 4 · CHAIN ATTRIBUTION
MICROSOFT AI RED TEAM · WHITEPAPER 2025
Taxonomy of Failure Modes in Agentic AI Systems
Most direct overlap — covers memory poisoning, XPIA, and multi-agent failures. Key difference: it's a security taxonomy, not an evaluation methodology. Tells you what can go wrong. Doesn't tell you how to measure it.
SECURITY FRAMING · ALL LAYERS
arXiv · FEBRUARY 2026
"MCP Tool Descriptions Are Smelly!"
Studied how tool description quality affects agent performance using MCP-Universe: 231 real-world tasks, 202 tools. Validates our Layer 1 (Ambiguous Tool Routing) from the docstring angle. They fix symptoms; this methodology evaluates them.
LAYER 1 · TOOL SELECTION
arXiv · SEPTEMBER 2025
Diagnosing Failure Root Causes in Agentic Platforms
Explicitly states existing methods "mainly focus on locating the step where failure occurs but fall short of diagnosing the failure root cause." Direct academic confirmation of the Layer 4 gap in a single sentence.
LAYER 4 · ROOT CAUSE GAP
NDSS 2026 · INJECAGENT / TOOLHIJACKER
Indirect Prompt Injection Benchmarks
Multiple papers benchmarking adversarial instructions embedded in tool results across 30+ agents and 17 tool types. Security community converging on our Layer 3 problem from the attack side. No eval methodology exists for the defense side.
LAYER 3 · INJECTION SECURITY
NIST · FEBRUARY 2026
AI Agent Standards Initiative
Launched by CAISI with three pillars: industry-led standards, open-source protocols, and agent security research. Regulatory pressure is building. The window to define methodology before standards are imposed is narrow.
STANDARDS · REGULATORY SIGNAL
Continue Reading

Part 2: The Taxonomy & The Solution

Now that you understand the blind spot, Part 2 maps all 19 failure modes across 4 layers — and proposes three evaluation primitives to close the gap. The full methodology framework is also available as a detailed research report.

Read Part 2 → Work Together