Overview
A lightweight model for detecting hallucinations in RAG pipeline outputs. Rather than simple binary classification, it uses a claim-by-claim decomposition approach inspired by FActScore.
It decomposes answers into individual atomic claims and verifies each against the source context as Supported / Unsupported / Contradicted.
Key Results
Benchmark Evaluation (500 samples, 10 benchmarks)
| Method | Accuracy | Hallu-F1 | Faith-F1 |
|---|---|---|---|
| Qwen3.5-9B (base) | 83.0% | 0.757 | 0.869 |
| Qwen3.5-9B + LoRA | 81.6% | 0.774 | 0.845 |
| GPT-5.4 | 69.8% | 0.691 | 0.705 |
The LoRA model achieved the highest Hallu-F1 (0.774).
Claude 4.6 Agreement
| Method | Agreement | FP | FN |
|---|---|---|---|
| Qwen3.5-9B (base) | 75.0% | 113 | 12 |
| Qwen3.5-9B + LoRA | 89.6% | 48 | 4 |
| GPT-5.4 | 91.0% | 4 | 41 |
A 9B model trained in 22 minutes achieved 89.6% agreement with Claude 4.6, nearly matching GPT-5.4's 91.0%.
Per-Benchmark Performance
| Benchmark | Accuracy |
|---|---|
| RAGBench-HotpotQA | 95.1% |
| FaithEval-CF | 93.3% |
| RAGBench-MSMARCO | 89.8% |
| HaluEval | 87.5% |
| RAGBench-FinQA | 87.0% |
| HaluBench | 84.1% |
Trained only on RAGTruth data, yet generalizes to 80%+ on external benchmarks.
Architecture
Base Model: Qwen3.5-9B (Self-Attention + Mamba hybrid, 48 layers)
Fine-tuning: LoRA — Rank 16, Alpha 32, ~100M trainable params (1.1% of total), 510MB adapter
Training: 22 minutes on RTX 5090, 3 epochs, 980 balanced samples
Data Quality Matters
| Version | Method | Accuracy |
|---|---|---|
| v2 (label leakage) | Ground truth in prompts | 61.1% |
| v3 (clean) | Label-free analysis | 82.1% |
Same model, same code — 21%p difference from data quality alone.
Deployment
- 4-bit quantization + 510MB adapter = 16GB VRAM
- GGUF Q4 can reduce to 6-8GB
- Inference: ~4.3s/sample (batch 8)
- Suited for async verification pipelines