Your agent scores 78%.
That number is lying to you.

Vugg automatically discovers the hidden slices where your agent fails 94% of the time — confirms each pattern is statistically real — and isolates the exact minimal condition that triggers it. No predefined categories. No manual trace reading.

τ-bench: GPT-4o scores 60% pass@1 → collapses to 25% consistency.  Vugg finds why.

pass@1 · τ-bench airline 78%
looks fine.
crack open
τ-bench SWE-bench GAIA

18 patterns found · 3 benchmarks · 8 models


01
Find

Embed failed runs. Induce failure patterns through LLM chain-of-thought. No clustering. No predefined categories.

02
Confirm

Generate targeted test cases for each pattern. Run your agent. Statistical proof the pattern is real, not noise.

03
Pinpoint

Contrastive testing isolates the exact minimal trigger. Not “fails on Django tasks” — “fails when Q() objects combine with .annotate() calls.”

Canonical failure patterns discovered by Vugg · 774 agent failures · 3 benchmarks

Pattern Benchmark Freq Cross-model
Repeated or duplicate book_reservation calls when user confirms τ-bench airline 11 2 models
Calls book_reservation instead of update_reservation on existing booking τ-bench airline 4 2 models
Diff patch written as raw Python and executed directly as code SWE-bench 11 2 models
Repository path duplicated in tool call arguments causing file not found SWE-bench 9 2 models
Uses Python built-in shadowed by sandbox restriction (e.g. open, exec) SWE-bench 6 2 models
Returns zero search queries on niche factual identifier lookups GAIA 8 5 models
Solves constraint puzzle by guessing without issuing any search GAIA 8 5 models
Fails to extract DOI endnote detail from specific page of multi-page doc GAIA 5 4 models
Skips identifier lookup for concrete real-world objects (ISBN, DOI, ISSN) GAIA 6 5 models
Returns per-item sequence instead of computing single aggregate value GAIA 4 3 models
Guesses year despite explicit "according to source" constraint in question GAIA 4 4 models

18 canonical patterns total · 13 merge events · 92% cross-model stable

Opus 4.1 Incomplete Task Finalization DeepSeek V3 Missing Finalization Step GPT-4o Incomplete Multi-step Task CANONICAL PATTERN Incomplete Task Finalization freq = 29 · 3 models · confidence: high

Three independent induction runs. Same failure. Three different names. Vugg merges them.

A vug is a hidden cavity inside a rock that looks completely ordinary from the outside. You crack it open and find perfectly formed crystalline structure that was growing in there the whole time.

That's what Vugg does to your eval results.