Overview
We tested whether collecting an LLM's actual past failures, abstracting them into reusable patterns, and injecting them into prompts reduces error repetition.
Task: Korean document-grounded fact verification (avg 1,424 words, 5 domains)
Core finding: Accuracy went from 81.2% to 90.0% (p=0.032), but the mechanism was not what we initially expected.
5-Arm Experiment Results
| Condition | Accuracy | NS F1 | vs Baseline | p-value |
|---|---|---|---|---|
| A - Baseline | 81.2% | 0.857 | — | — |
| B - Static toy examples | 82.5% | 0.870 | +1.2%p | 0.501 |
| D - Length control | 82.5% | 0.868 | +1.2%p | 0.512 |
| B' - Random real failures | 91.2% | 0.939 | +10.0%p | 0.010 |
| C - Retrieved failures | 90.0% | 0.931 | +8.8%p | 0.032 |
No significant difference between B' and C (p=0.749). Retrieval added no value.
Effect Decomposition
| Factor | Delta | Significant? | Interpretation |
|---|---|---|---|
| Length/attention (A→D) | +1.2%p | No | Longer prompts don't help |
| Failure content (D→B') | +8.8%p | Yes | This is the driver |
| Retrieval (B'→C) | -1.2%p | No | No added value |
Per-Error-Type Improvement
| Error Type | Baseline | With Failures | Delta |
|---|---|---|---|
| Factual Mismatch | 100% | 100% | 0 |
| Negation Flip | 62% | 100% | +38%p |
| Condition/Intensity | 62% | 77% | +15%p |
| Certainty/Status | 46% | 77% | +31%p |
| Ungrounded Reasoning | 100% | 100% | 0 |
3-Phase Pipeline
Phase 1: Run 80 items, collect 13 failures (16.2% error rate)
Phase 2: Abstract each failure into Pattern / Signal / Lesson
Phase 3: Test on 80 completely new items across 5 conditions
Why Random Works
With 77% of failures concentrated in types 5 and 6, the probability of randomly sampling 3 items with zero relevant failures is only 1.2%.