Agentic AI Eval Research — Phase 1, Part 2 · March 2026

Your AI agent is lying to you — the taxonomy and the fix

A systematic map of the 19 failure modes that existing frameworks can't catch — and the three evaluation primitives that would close the gap.

19 Failure Types
4 Taxonomy Layers
14.2% Step Attribution Accuracy
(ICML 2025 Baseline)
3 Evaluation
Primitives Proposed
scroll
Phase 1 Part 1: The Blind Spot Part 2: Taxonomy & Solution
In this article
  1. A 4-Layer Failure Taxonomy for Agentic AI Tool Calls
  2. Three Evaluation Primitives That Don't Exist Yet

Part 1 established the blind spot: existing eval frameworks were built for RAG, not tool execution, and the gap is both systematic and unaddressed. This part maps the 19 failure modes across four layers — and proposes three evaluation primitives to close it.

⚠ EVAL BLIND SPOT · 19 FAILURE MODES User INTENT Agent MODEL Select L1 TOOL Build L2 ARGS Execute TOOL CALL Parse L3 OUTPUT Chain L4 MULTI Response OUTPUT ✓ MEASURED ✗ NOT EVALUATED BY ANY FRAMEWORK ✓ MEASURED
Fig. 1 — The full agent tool execution pipeline. Existing eval frameworks measure input quality and final output quality, but are blind to the five stages in between where 19 distinct failure modes occur.
KEY FINDING

Current eval frameworks evaluate what the model says. Almost nobody systematically evaluates what tools do — and whether the model correctly interpreted the result. This gap is confirmed by both academic research (Zhang et al., ICML 2025: 14.2% step-level attribution accuracy) and practitioner admissions (Arcade Evals, Feb 2026: explicitly excludes post-execution validation).


A 4-Layer Failure Taxonomy for
Agentic AI Tool Calls

Most agentic failures aren't model failures — they're architecture failures concentrated at the tool execution layer. The taxonomy below organises 19 failure types across four layers, ordered by when in the execution chain they occur.

LAYER FAILURE COUNT (19 TOTAL) DETECTION 01 Tool Selection 4 failure types · wrong tool, omission, routing 4 MED–HIGH 1 silent failure mode 02 Input Construction 5 failure types · semantic errors, injection, schema mismatch 5 HIGH includes CVE-2025-68144 03 Output Consumption 5 failure types · hallucination, stale data, prompt injection 5 HIGH — SILENT 4 of 5 throw no error 04 Chain & Multi-Tool 5 failure types · error propagation, privilege pivot, state corruption 5 VERY HIGH emergent + 3 security CVEs TOTAL: 19 FAILURE TYPES — 0 COVERED BY EXISTING EVAL FRAMEWORKS
Fig. 2 — All 19 failure types across the 4-layer taxonomy. Bar fill is proportional to failure count (max = 5). Detection difficulty reflects how reliably failures surface before reaching end users.

Click each layer to explore the specific failure modes and their detection difficulty ratings.

01
Tool Selection
Did the agent even pick the right tool?
4 failures
False tool trigger
Tool called when not needed, causing side effects
Medium
Tool omission
No tool called; model hallucinates the answer
HIGH — silent
Wrong tool selection
query_data called instead of list_tables
Medium
Ambiguous tool routing
Wrong tool chosen due to vague docstrings
Medium-High
02
Input Construction
Right tool, wrong arguments
5 failures
Syntactic argument error
Invalid SQL; tool returns explicit error
LOW — explicit
Semantic argument error
SQL runs but returns wrong data — the hardest problem
HIGH — silent
Argument injection
User input interpreted as CLI flags (CVE-2025-68144)
HIGH — security
Schema mismatch
Args generated for a different API version
Medium
Over/under-scoped query
Far too much or too little data retrieved
Medium
03
Output Consumption
Tool ran fine. Model got it wrong.
5 failures
Result hallucination completion
Tool returns partial data; model invents the rest
HIGH — silent
Stale data trust
Cached result presented as current fact
HIGH — silent
Misinterpretation of format
JSON field parsed incorrectly
Medium
Prompt injection via result
Tool returns adversarial instructions in content
HIGH — security
Overconfident trust
Uncertain result presented with false confidence
HIGH — silent
04
Chain & Multi-Tool
Emergent failures across tool sequences
5 failures
Error propagation
Tool A bad data → Tool B consumes it silently
HIGH — compounding
Privilege pivot
Auth token from Tool A used by Tool B unintentionally
HIGH — security
Infinite retry loop
Tool fails; agent retries indefinitely
Medium — detectable
State corruption
Write tool uses stale read from earlier in chain
HIGH — silent
Toxic combinations
Each tool safe alone; combined creates exploit (CVE-2025-68143/44/45)
VERY HIGH — emergent

Detection difficulty reflects ease of catching failures before they reach end users. Silent = no error thrown.


Three Evaluation Primitives
That Don't Exist Yet

The Viola Conseil methodology is built around three primitives that address the white space directly. Together, they constitute a framework that doesn't exist today as a unified methodology.

Tool Intent Alignment
Evaluates whether tool selection and argument construction correctly reflects the user's original intent — not just whether the tool was called, but whether the call means what the user asked.
ADDRESSES → LAYERS 1 & 2
Output Trust Calibration
Assesses whether tool outputs are complete, current, and safe to consume before the model builds a response on top of them. Stale data, partial results, and injected content all currently pass through without validation.
ADDRESSES → LAYER 3
Chain Failure Attribution
Automatically attributes multi-step failures to the specific layer and tool interaction responsible. Current research achieves only 14.2% step-level accuracy (ICML 2025). This primitive aims to make attribution systematic, not manual.
ADDRESSES → LAYER 4
Get the Full Research

This is Phase 1 of an ongoing research program

The complete gap analysis, full taxonomy documentation, and methodology framework are available as a detailed research report. Phase 2 covers metrics design and sandbox instrumentation.

Request the Full Report Work Together