Your agent scores 78%.
That number is lying to you.

Vugg automatically discovers the hidden slices where your agent fails 94% of the time — confirms each pattern is statistically real — and isolates the exact minimal condition that triggers it. No predefined categories. No manual trace reading.

Star on GitHub Read the Paper

τ-bench: GPT-4o scores 60% pass@1 → collapses to 25% consistency. Vugg finds why.

Find

Embed failed runs. Induce failure patterns through LLM chain-of-thought. No clustering. No predefined categories.

Confirm

Generate targeted test cases for each pattern. Run your agent. Statistical proof the pattern is real, not noise.

Pinpoint

Contrastive testing isolates the exact minimal trigger. Not “fails on Django tasks” — “fails when Q() objects combine with .annotate() calls.”

What a Vugg report looks like

Pattern          Incomplete Task Finalization
Benchmark        τ-bench airline
Failure rate     87%  (baseline: 23%, p < 0.001)
Trigger          Restricted action mentioned before
                 unrestricted action in same request
Minimal cond.    Order of mention, not content
Evidence         29 failures across 4 models
Cross-model      ✓  (Opus, DeepSeek, GPT-4o, Sonnet)

A vug is a hidden cavity inside a rock that looks completely ordinary from the outside. You crack it open and find perfectly formed crystalline structure that was growing in there the whole time.

That's what Vugg does to your eval results.

Your agent scores 78%. That number is lying to you.

Your agent scores 78%.
That number is lying to you.