Your agent scores 78%.
That number is lying to you.
Vugg automatically discovers the hidden slices where your agent fails 94% of the time — confirms each pattern is statistically real — and isolates the exact minimal condition that triggers it. No predefined categories. No manual trace reading.
τ-bench: GPT-4o scores 60% pass@1 → collapses to 25% consistency. Vugg finds why.
Embed failed runs. Induce failure patterns through LLM chain-of-thought. No clustering. No predefined categories.
Generate targeted test cases for each pattern. Run your agent. Statistical proof the pattern is real, not noise.
Contrastive testing isolates the exact minimal trigger. Not “fails on Django tasks” — “fails when Q() objects combine with .annotate() calls.”
What a Vugg report looks like
Pattern Incomplete Task Finalization Benchmark τ-bench airline Failure rate 87% (baseline: 23%, p < 0.001) Trigger Restricted action mentioned before unrestricted action in same request Minimal cond. Order of mention, not content Evidence 29 failures across 4 models Cross-model ✓ (Opus, DeepSeek, GPT-4o, Sonnet)
A vug is a hidden cavity inside a rock that looks completely ordinary from the outside. You crack it open and find perfectly formed crystalline structure that was growing in there the whole time.
That's what Vugg does to your eval results.