Your agent scores 78%.
That number is lying to you.

Vugg automatically discovers the hidden slices where your agent fails 94% of the time — confirms each pattern is statistically real — and isolates the exact minimal condition that triggers it. No predefined categories. No manual trace reading.

Star on GitHub Read the Paper

τ-bench: GPT-4o scores 60% pass@1 → collapses to 25% consistency. Vugg finds why.

pass@1 · τ-bench airline 78%

looks fine.

crack open

τ-bench SWE-bench GAIA

18 patterns found · 3 benchmarks · 8 models

Find

Embed failed runs. Induce failure patterns through LLM chain-of-thought. No clustering. No predefined categories.

Confirm

Generate targeted test cases for each pattern. Run your agent. Statistical proof the pattern is real, not noise.

Pinpoint

Contrastive testing isolates the exact minimal trigger. Not “fails on Django tasks” — “fails when Q() objects combine with .annotate() calls.”

Canonical failure patterns discovered by Vugg · 774 agent failures · 3 benchmarks

Pattern	Benchmark	Freq	Cross-model
Repeated or duplicate `book_reservation` calls when user confirms	τ-bench airline	11	✓ 2 models
Calls `book_reservation` instead of `update_reservation` on existing booking	τ-bench airline	4	✓ 2 models
Diff patch written as raw Python and executed directly as code	SWE-bench	11	✓ 2 models
Repository path duplicated in tool call arguments causing file not found	SWE-bench	9	✓ 2 models
Uses Python built-in shadowed by sandbox restriction (e.g. `open`, `exec`)	SWE-bench	6	✓ 2 models
Returns zero search queries on niche factual identifier lookups	GAIA	8	✓ 5 models
Solves constraint puzzle by guessing without issuing any search	GAIA	8	✓ 5 models
Fails to extract DOI endnote detail from specific page of multi-page doc	GAIA	5	✓ 4 models
Skips identifier lookup for concrete real-world objects (ISBN, DOI, ISSN)	GAIA	6	✓ 5 models
Returns per-item sequence instead of computing single aggregate value	GAIA	4	✓ 3 models
Guesses year despite explicit "according to source" constraint in question	GAIA	4	✓ 4 models

18 canonical patterns total · 13 merge events · 92% cross-model stable

Three independent induction runs. Same failure. Three different names. Vugg merges them.

A vug is a hidden cavity inside a rock that looks completely ordinary from the outside. You crack it open and find perfectly formed crystalline structure that was growing in there the whole time.

That's what Vugg does to your eval results.

Your agent scores 78%. That number is lying to you.

Your agent scores 78%.
That number is lying to you.