Voice clone detection is the hardest problem in synthetic media forensics. Unlike deepfake video — where spatial artifacts in face regions provide strong detection signal — high-quality voice clones are designed to eliminate detectable artifacts. We ran a comprehensive benchmark of three leading audio detection models on 12,000 samples across multiple synthesis systems.
Methodology
The benchmark dataset contains 12,000 audio samples: 6,000 genuine human recordings from diverse speakers, languages, and recording conditions; and 6,000 synthetic samples generated by multiple voice synthesis systems. All synthetic samples were generated from genuine recordings in the dataset, allowing direct speaker-matched comparison. Samples range from 3 to 60 seconds. Each sample was evaluated by each model independently, with no ensemble combination, to measure individual model performance.
RawNet2
RawNet2 operates directly on raw waveform data, learning its own feature representations through convolutional layers rather than relying on hand-crafted features. This gives it strong performance on novel synthesis systems whose artifacts differ from the training distribution, because it learns low-level waveform properties rather than specific artifact patterns.
Benchmark results: 96.2% overall accuracy, 97.1% TPR, 4.8% FPR. Performance degraded most significantly on samples under 5 seconds (89.4% accuracy), which is expected given the reduced signal available for waveform analysis. Strongest performance was on voice clones generated by neural TTS systems (98.3% accuracy).
Wav2Vec2 XLSR
Wav2Vec2 XLSR is a self-supervised speech model pre-trained on 128 languages. For voice clone detection, we fine-tuned the XLSR checkpoint on the ASVspoof 2021 dataset with additional synthetic samples. Wav2Vec2 operates on learned feature representations of speech, making it particularly effective at detecting prosodic anomalies — the subtle patterns of stress and rhythm that voice cloning systems struggle to reproduce accurately.
Benchmark results: 93.8% overall accuracy, 94.2% TPR, 6.1% FPR. Wav2Vec2 showed stronger performance than RawNet2 on compressed audio (e.g. telephone-quality recordings at 8kHz), where waveform artifacts that RawNet2 relies on are partially destroyed by compression. This makes Wav2Vec2 particularly useful for call center fraud detection use cases.
MFCC-CNN
The MFCC-CNN approach extracts Mel-frequency cepstral coefficients — a hand-crafted representation of the spectral envelope of audio — and feeds them into a convolutional classifier. MFCC features are the most interpretable of the three approaches: specific cepstral bins correspond to identifiable acoustic properties, making it possible to understand why a specific recording was classified as synthetic.
Benchmark results: 91.1% overall accuracy, 92.4% TPR, 8.3% FPR. MFCC-CNN showed the largest performance gap between seen and unseen synthesis systems: 94.2% on systems in the training distribution, 87.9% on out-of-distribution systems. This is the expected weakness of hand-crafted feature approaches compared to end-to-end learned representations.
Ensemble performance
The TruthScan ensemble combines all three models using a learned voting matrix. Ensemble benchmark results: 96.4% overall accuracy, 97.8% TPR, 3.1% FPR. The ensemble consistently outperforms any individual model, with the largest gains on short clips (under 5 seconds) where individual models are weakest, and on out-of-distribution synthesis systems where model diversity provides robustness.
Conclusions
No single model dominates across all conditions. RawNet2 is the best single model for high-quality audio. Wav2Vec2 is preferred for compressed or telephone-quality audio. MFCC-CNN provides interpretability and complementary signal for ensemble combination. For production deployment, the ensemble is clearly the right choice — the false positive rate reduction from 4.8% (RawNet2 alone) to 3.1% (ensemble) is operationally significant at scale.