Your agent scores 78%.
That number is lying to you.
Vugg automatically discovers the hidden slices where your agent fails 94% of the time — confirms each pattern is statistically real — and isolates the exact minimal condition that triggers it. No predefined categories. No manual trace reading.
τ-bench: GPT-4o scores 60% pass@1 → collapses to 25% consistency. Vugg finds why.
18 patterns found · 3 benchmarks · 8 models
Embed failed runs. Induce failure patterns through LLM chain-of-thought. No clustering. No predefined categories.
Generate targeted test cases for each pattern. Run your agent. Statistical proof the pattern is real, not noise.
Contrastive testing isolates the exact minimal trigger. Not “fails on Django tasks” — “fails when Q() objects combine with .annotate() calls.”
Canonical failure patterns discovered by Vugg · 774 agent failures · 3 benchmarks
| Pattern | Benchmark | Freq | Cross-model |
|---|---|---|---|
Repeated or duplicate book_reservation calls when user confirms |
τ-bench airline | 11 | ✓ 2 models |
Calls book_reservation instead of update_reservation on existing booking |
τ-bench airline | 4 | ✓ 2 models |
| Diff patch written as raw Python and executed directly as code | SWE-bench | 11 | ✓ 2 models |
| Repository path duplicated in tool call arguments causing file not found | SWE-bench | 9 | ✓ 2 models |
Uses Python built-in shadowed by sandbox restriction (e.g. open, exec) |
SWE-bench | 6 | ✓ 2 models |
| Returns zero search queries on niche factual identifier lookups | GAIA | 8 | ✓ 5 models |
| Solves constraint puzzle by guessing without issuing any search | GAIA | 8 | ✓ 5 models |
| Fails to extract DOI endnote detail from specific page of multi-page doc | GAIA | 5 | ✓ 4 models |
| Skips identifier lookup for concrete real-world objects (ISBN, DOI, ISSN) | GAIA | 6 | ✓ 5 models |
| Returns per-item sequence instead of computing single aggregate value | GAIA | 4 | ✓ 3 models |
| Guesses year despite explicit "according to source" constraint in question | GAIA | 4 | ✓ 4 models |
18 canonical patterns total · 13 merge events · 92% cross-model stable
Three independent induction runs. Same failure. Three different names. Vugg merges them.
A vug is a hidden cavity inside a rock that looks completely ordinary from the outside. You crack it open and find perfectly formed crystalline structure that was growing in there the whole time.
That's what Vugg does to your eval results.