Agentic AI Eval Research — Phase 1, Part 2 · March 2026

Your AI agent is lying to you — the taxonomy and the fix

A systematic map of the 19 failure modes that existing frameworks can't catch — and the three evaluation primitives that would close the gap.

19 Failure Types

4 Taxonomy Layers

14.2% Step Attribution Accuracy
(ICML 2025 Baseline)

3 Evaluation
Primitives Proposed

Viola Conseil Research · March 2026 · 12 min read

Agentic AI Tool Layer Failure Taxonomy MCP

scroll

Phase 1 Part 1: The Blind Spot → Part 2: Taxonomy & Solution

In this article

A 4-Layer Failure Taxonomy for Agentic AI Tool Calls
Three Evaluation Primitives That Don't Exist Yet

Part 1 established the blind spot: existing eval frameworks were built for RAG, not tool execution, and the gap is both systematic and unaddressed. This part maps the 19 failure modes across four layers — and proposes three evaluation primitives to close it.

Fig. 1 — The full agent tool execution pipeline. Existing eval frameworks measure input quality and final output quality, but are blind to the five stages in between where 19 distinct failure modes occur.

The Framework

A 4-Layer Failure Taxonomy for
Agentic AI Tool Calls

Most agentic failures aren't model failures — they're architecture failures concentrated at the tool execution layer. The taxonomy below organises 19 failure types across four layers, ordered by when in the execution chain they occur.

Fig. 2 — All 19 failure types across the 4-layer taxonomy. Bar fill is proportional to failure count (max = 5). Detection difficulty reflects how reliably failures surface before reaching end users.

Click each layer to explore the specific failure modes and their detection difficulty ratings.

01Tool Selection
Did the agent even pick the right tool?
4 failures▾
False tool trigger
Tool called when not needed, causing side effects
Medium
Tool omission
No tool called; model hallucinates the answer
HIGH — silent
Wrong tool selection
query_data called instead of list_tables
Medium
Ambiguous tool routing
Wrong tool chosen due to vague docstrings
Medium-High
02Input Construction
Right tool, wrong arguments
5 failures▾
Syntactic argument error
Invalid SQL; tool returns explicit error
LOW — explicit
Semantic argument error
SQL runs but returns wrong data — the hardest problem
HIGH — silent
Argument injection
User input interpreted as CLI flags (CVE-2025-68144)
HIGH — security
Schema mismatch
Args generated for a different API version
Medium
Over/under-scoped query
Far too much or too little data retrieved
Medium
03Output Consumption
Tool ran fine. Model got it wrong.
5 failures▾
Result hallucination completion
Tool returns partial data; model invents the rest
HIGH — silent
Stale data trust
Cached result presented as current fact
HIGH — silent
Misinterpretation of format
JSON field parsed incorrectly
Medium
Prompt injection via result
Tool returns adversarial instructions in content
HIGH — security
Overconfident trust
Uncertain result presented with false confidence
HIGH — silent
04Chain & Multi-Tool
Emergent failures across tool sequences
5 failures▾
Error propagation
Tool A bad data → Tool B consumes it silently
HIGH — compounding
Privilege pivot
Auth token from Tool A used by Tool B unintentionally
HIGH — security
Infinite retry loop
Tool fails; agent retries indefinitely
Medium — detectable
State corruption
Write tool uses stale read from earlier in chain
HIGH — silent
Toxic combinations
Each tool safe alone; combined creates exploit (CVE-2025-68143/44/45)
VERY HIGH — emergent

Detection difficulty reflects ease of catching failures before they reach end users. Silent = no error thrown.

The Solution

Three Evaluation Primitives
That Don't Exist Yet

The Viola Conseil methodology is built around three primitives that address the white space directly. Together, they constitute a framework that doesn't exist today as a unified methodology.

Tool Intent Alignment

Evaluates whether tool selection and argument construction correctly reflects the user's original intent — not just whether the tool was called, but whether the call means what the user asked.

ADDRESSES → LAYERS 1 & 2

Output Trust Calibration

Assesses whether tool outputs are complete, current, and safe to consume before the model builds a response on top of them. Stale data, partial results, and injected content all currently pass through without validation.

ADDRESSES → LAYER 3

Chain Failure Attribution

Automatically attributes multi-step failures to the specific layer and tool interaction responsible. Current research achieves only 14.2% step-level accuracy (ICML 2025). This primitive aims to make attribution systematic, not manual.

ADDRESSES → LAYER 4

Get the Full Research

This is Phase 1 of an ongoing research program

The complete gap analysis, full taxonomy documentation, and methodology framework are available as a detailed research report. Phase 2 covers metrics design and sandbox instrumentation.

Request the Full Report Work Together

Your AI agent is lying to you — the taxonomy and the fix

A 4-Layer Failure Taxonomy forAgentic AI Tool Calls

Three Evaluation PrimitivesThat Don't Exist Yet

This is Phase 1 of an ongoing research program

A 4-Layer Failure Taxonomy for
Agentic AI Tool Calls

Three Evaluation Primitives
That Don't Exist Yet