ICML 2025 SPOTLIGHT · PENN STATE / DUKE
"Which Agent Causes Task Failures and When?"
Built the Who&When dataset across 127 multi-agent systems. Best tested attribution method achieves only 53.5% agent-level accuracy and 14.2% step-level accuracy. Authors conclude methods "fail to achieve practical usability." This is our Layer 4 baseline to beat.
LAYER 4 · CHAIN ATTRIBUTION
MICROSOFT AI RED TEAM · WHITEPAPER 2025
Taxonomy of Failure Modes in Agentic AI Systems
Most direct overlap — covers memory poisoning, XPIA, and multi-agent failures. Key difference: it's a security taxonomy, not an evaluation methodology. Tells you what can go wrong. Doesn't tell you how to measure it.
SECURITY FRAMING · ALL LAYERS
arXiv · FEBRUARY 2026
"MCP Tool Descriptions Are Smelly!"
Studied how tool description quality affects agent performance using MCP-Universe: 231 real-world tasks, 202 tools. Validates our Layer 1 (Ambiguous Tool Routing) from the docstring angle. They fix symptoms; this methodology evaluates them.
LAYER 1 · TOOL SELECTION
arXiv · SEPTEMBER 2025
Diagnosing Failure Root Causes in Agentic Platforms
Explicitly states existing methods "mainly focus on locating the step where failure occurs but fall short of diagnosing the failure root cause." Direct academic confirmation of the Layer 4 gap in a single sentence.
LAYER 4 · ROOT CAUSE GAP
NDSS 2026 · INJECAGENT / TOOLHIJACKER
Indirect Prompt Injection Benchmarks
Multiple papers benchmarking adversarial instructions embedded in tool results across 30+ agents and 17 tool types. Security community converging on our Layer 3 problem from the attack side. No eval methodology exists for the defense side.
LAYER 3 · INJECTION SECURITY
NIST · FEBRUARY 2026
AI Agent Standards Initiative
Launched by CAISI with three pillars: industry-led standards, open-source protocols, and agent security research. Regulatory pressure is building. The window to define methodology before standards are imposed is narrow.
STANDARDS · REGULATORY SIGNAL