Conscience Technology
Researchresearch

Nora Hallucination Detector: Frontier-level Hallucination Detection with a 9B Model

April 12, 2026

Overview

A lightweight model for detecting hallucinations in RAG pipeline outputs. Rather than simple binary classification, it uses a claim-by-claim decomposition approach inspired by FActScore.

It decomposes answers into individual atomic claims and verifies each against the source context as Supported / Unsupported / Contradicted.


Key Results

Benchmark Evaluation (500 samples, 10 benchmarks)

MethodAccuracyHallu-F1Faith-F1
Qwen3.5-9B (base)83.0%0.7570.869
Qwen3.5-9B + LoRA81.6%0.7740.845
GPT-5.469.8%0.6910.705

The LoRA model achieved the highest Hallu-F1 (0.774).

Claude 4.6 Agreement

MethodAgreementFPFN
Qwen3.5-9B (base)75.0%11312
Qwen3.5-9B + LoRA89.6%484
GPT-5.491.0%441

A 9B model trained in 22 minutes achieved 89.6% agreement with Claude 4.6, nearly matching GPT-5.4's 91.0%.

Per-Benchmark Performance

BenchmarkAccuracy
RAGBench-HotpotQA95.1%
FaithEval-CF93.3%
RAGBench-MSMARCO89.8%
HaluEval87.5%
RAGBench-FinQA87.0%
HaluBench84.1%

Trained only on RAGTruth data, yet generalizes to 80%+ on external benchmarks.


Architecture

Base Model: Qwen3.5-9B (Self-Attention + Mamba hybrid, 48 layers)

Fine-tuning: LoRA — Rank 16, Alpha 32, ~100M trainable params (1.1% of total), 510MB adapter

Training: 22 minutes on RTX 5090, 3 epochs, 980 balanced samples


Data Quality Matters

VersionMethodAccuracy
v2 (label leakage)Ground truth in prompts61.1%
v3 (clean)Label-free analysis82.1%

Same model, same code — 21%p difference from data quality alone.


Deployment

  • 4-bit quantization + 510MB adapter = 16GB VRAM
  • GGUF Q4 can reduce to 6-8GB
  • Inference: ~4.3s/sample (batch 8)
  • Suited for async verification pipelines