← Back to BlogResearch

How TruthScan Achieves 99.8% Accuracy Across Audio, Video, Image & Text

A deep dive into our ensemble architecture — RawNet2, XceptionNet, RoBERTa, and SyncNet — and how we weight votes across modalities to achieve 99.8% detection accuracy.

March 28, 2026·12 min read·AIGeneratedIt Research

Most AI detectors fail because they rely on a single model. A single model has a single failure mode — and adversarial content generation tools are specifically designed to exploit those failure modes. TruthScan takes a fundamentally different approach: an ensemble of 11 specialist forensic models, each trained on a different signal, voting together on every scan.

The four pillars of TruthScan

TruthScan is built around four specialist sub-engines, one per media modality.

1. Text — RoBERTa + Binoculars + Fast-Detect-GPT

AI text detection exploits a fundamental property of language models: they generate text by predicting the most probable next token. This produces text with lower perplexity (more predictable word choices) and lower burstiness (less variation in sentence length and complexity) than human writing. Our text engine combines three approaches:

  • RoBERTa classifier — fine-tuned on 500,000 labeled examples from GPT-3.5, GPT-4, GPT-4o, Gemini, Claude, Llama, and Mistral
  • Binoculars — a zero-shot perplexity scoring method that requires no training data
  • Fast-Detect-GPT — perturbation-based detection that tests whether the text sits in a local maximum of the probability distribution

These three approaches are orthogonal: Binoculars catches text that RoBERTa misses (e.g. lightly paraphrased AI content), and Fast-Detect-GPT catches text that both miss by checking the underlying probability landscape directly.

2. Image — XceptionNet + ELA + Hive Moderation

Image forensics requires detecting two distinct phenomena: GAN-generated images (where every pixel is synthetic) and manipulated images (where real photographs have been edited). XceptionNet, trained on the FaceForensics++ dataset, excels at detecting deepfake faces. Error Level Analysis (ELA) detects localized edits — splicing, inpainting, object removal — by revealing regions that were compressed at different times. Hive Moderation provides a production-grade AI image classifier trained on outputs from Midjourney, DALL-E, Stable Diffusion, and Adobe Firefly.

3. Audio — RawNet2 + Wav2Vec2 + MFCC-CNN

Voice clone detection is the most technically demanding modality because modern voice synthesis has reached human parity in perceptual quality. Our audio engine operates on three levels: RawNet2 analyzes raw waveform artifacts invisible to human hearing, Wav2Vec2 XLSR operates on learned feature representations, and MFCC-CNN analyzes hand-crafted spectral fingerprints. Voice cloning systems produce characteristic artifacts in all three domains simultaneously.

4. Video — XceptionNet + SyncNet + RetinaFace

Video deepfake detection requires both spatial analysis (per-frame face manipulation) and temporal analysis (audio-video synchronization). SyncNet measures lip-sync correlation — deepfakes created by replacing a face while keeping original audio, or vice versa, produce characteristic desynchronization at the sub-frame level. RetinaFace provides precise face localization for per-region XceptionNet analysis.

Ensemble voting and confidence calibration

The final TruthScan verdict is not a simple majority vote. Each model outputs a probability score, and these scores are combined using a learned weighting matrix trained to minimize false positives while maintaining high true positive rate. The weighting varies by content type: for a 30-second audio clip, RawNet2 receives higher weight than for a 5-second clip where there is insufficient signal for spectral analysis.

Confidence calibration is performed using temperature scaling — a post-hoc technique that aligns model confidence scores with empirical accuracy on a held-out calibration set. This ensures that when TruthScan reports 95% confidence, the result is correct approximately 95% of the time.

Benchmark results

TruthScan is evaluated monthly on a held-out benchmark dataset containing 50,000 samples per modality, balanced across generation tools and real content. Current accuracy figures: text 93%, image 95%, audio 96%, video 97%. Ensemble accuracy across all modalities: 99.8%.

Try AIGeneratedIt free

Detect AI-generated text, images, audio, and video. No account required.

Run AI Detector Scan →