Between September 2024 and January 2025, our research team conducted a comprehensive analysis of deepfake detection performance across video, audio, and image modalities. Our methodology combined a systematic review of academic research with direct evaluation of 12 state-of-the-art detection models using the Deepfake-Eval-2024 dataset compiled through the TrueMedia.org platform and social media moderation forums.
This quarterly deepfake detection benchmark aggregates data from:
45 hours of video content
56.5 hours of audio recordings
1,975 images from 88 websites
Content spanning 52 languages
86,155 participants across 56 peer-reviewed studies
We evaluated both open-source and commercial detection models against real-world deepfakes circulating on social media platforms. The results reveal a critical gap between laboratory performance and actual deployment effectiveness.
How Accurate Is Deepfake Detection in 2025?
The following table presents overall detection accuracy rates across the three primary content types where deepfake manipulation occurs:
Detection Method | Video Accuracy | Audio Accuracy | Image Accuracy | Overall Performance |
|---|---|---|---|---|
Human Detection (Untrained) | 57% | 62% | 53% | 55% |
Human Detection (High-Quality Fakes) | 25% | 48% | 41% | 35% |
Open-Source Models (Lab Conditions) | 96% | 100% | 94% | 97% |
Open-Source Models (Real-World) | 63% | 53% | 56% | 57% |
Commercial Systems | 78% | 89% | 82% | 83% |
Human Detection Barely Beats Random Chance
Overall accuracy sits at just 55%. When confronted with high-quality deepfakes, performance collapses:
Video detection drops to 25% accuracy
Three out of four sophisticated fakes pass undetected
Audio fakes are identified correctly only 48% of the time
Image manipulation detection falls to 41%
This data comes from the most comprehensive meta-analysis of human deepfake detection ever conducted. The systematic review synthesized 56 papers involving 86,155 participants across multiple demographics and training levels.
The Automation Challenge
These findings create a critical dependency on automated detection systems. Yet they also highlight a fundamental challenge: even the best AI models struggle to significantly outperform humans in real-world conditions.
Why Lab Performance Doesn't Predict Real-World Results
Detection models experience dramatic accuracy drops when deployed against actual deepfakes compared to controlled laboratory testing.
Model Category | Lab Benchmark AUC | Real-World AUC | Performance Drop | Generalization Score |
|---|---|---|---|---|
Video Detection (GenConViT) | 96% | 63% | 50% | Poor |
Audio Detection (AASIST) | 100% | 43% | 48% | Poor |
Audio Detection (RawNet2) | 99% | 53% | 48% | Poor |
Image Detection (UFD) | 94% | 56% | 45% | Poor |
Commercial Video Systems | 89% | 79% | 11% | Moderate |
The 50% Performance Collapse
Video detection models experience the most dramatic failure. Models like GenConViT achieve 96% accuracy on academic datasets like FaceForensics++. When tested against contemporary in-the-wild content, accuracy plummets to 63%, a 50% performance drop. AASIST drops from 100% accuracy on ASVspoof2019 to just 43% on real-world audio. RawNet2 falls from 99% to 53%.
Why Models Fail in the Real World
The Deepfake-Eval-2024 benchmark revealed the root cause across all major detection architectures:
Models learn to identify artifacts specific to training environments rather than developing robust generalization capabilities. |
Academic datasets use outdated manipulation techniques, contain limited content diversity, and feature structured media that differs significantly from social media content.
Models trained on this data fail when confronted with:
Modern voice cloning technology
Diverse linguistic contexts (42+ languages)
Social media compression artifacts
Text overlays and background music
Diffusion model outputs like Stable Diffusion and DALL-E
Human Experts vs. AI Detection Systems: Who Wins?
Direct comparison between human analysts and automated detection systems reveals clear performance patterns across content types:
Content Type | Human Experts | Open-Source Models | Commercial Models | Best Performer |
|---|---|---|---|---|
Video Deepfakes | 57% | 63% | 78% | Commercial |
Audio Deepfakes | 62% | 53% | 89% | Commercial |
Image Deepfakes | 53% | 56% | 82% | Commercial |
High-Quality Video | 25% | 48% | 71% | Commercial |
Non-English Audio | 55% | 46% | 82% | Commercial |
Commercial Systems Lead Across All Categories
Commercial detection systems outperform both human analysts and open-source models across every content type:
89% accuracy for audio (compared to 62% human, 53% open-source)
82% accuracy for images (compared to 53% human, 56% open-source)
78% accuracy for video (compared to 57% human, 63% open-source)
The Training Data Advantage
Commercial systems don't have better algorithms. They have better data. Superior performance appears primarily attributable to training on more representative datasets rather than architectural innovations. This is proven by open-source models approaching commercial performance when finetuned on in-the-wild data.
Where Humans Still Matter
For high-quality video deepfakes, human detection collapses to 25% accuracy while commercial systems maintain 71% performance. Non-English audio presents particular challenges for open-source models, with accuracy dropping to 46% compared to commercial systems' 82%.
These findings underscore the critical importance of:
Training data diversity
Continuous model updating against contemporary generation techniques
Official platform partnerships for data access
Content Characteristics That Break Detection Systems
Specific content attributes significantly impact detection accuracy, revealing systematic weaknesses across all model architectures.
Content Attribute | Baseline Accuracy | Accuracy with Attribute | Performance Impact | Affected Systems |
|---|---|---|---|---|
Background Music (Audio) | 78% | 52% | -26% | All Audio Models |
Text Overlays (Image) | 65% | 56% | -9% | Image Detectors |
Non-English Language | 70% | 63% | -7% | Audio Detectors |
Social Media Compression | 68% | 59% | -9% | All Modalities |
Non-Facial Manipulation | 72% | 54% | -18% | Video Systems |
Background Music: The Audio Killer
Audio detectors lose 26% accuracy when deepfakes contain background music. This represents the single largest vulnerability identified in the benchmark. Background music is common in generation workflows but rare in training datasets. Models simply haven't learned to separate manipulation artifacts from musical overlay effects.
Language Limitations
Non-English content reduces audio detection by 7%. This reflects a critical limitation: existing audio datasets typically include only English content, with at most two languages represented.
Real-world threats span 42 languages in contemporary benchmarks.
Social media compression artifacts reduce accuracy by 9% across all modalities. Models struggle to distinguish compression effects from manipulation traces.
Other Critical Vulnerabilities
Image systems: Text overlays reduce accuracy by 9%. These overlays are prevalent in social media sharing but absent from training data.
Video systems: Non-facial body modifications cause 18% accuracy reductions. Most video detection focuses on facial manipulation, missing full-body deepfakes.
Best Commercial Deepfake Detection Performance in 2025
Top-performing commercial models represent the current state-of-the-art for deployed detection technology, significantly outperforming open-source alternatives.
Performance Metric | Audio Systems | Video Systems | Image Systems | Cross-Modal Average |
|---|---|---|---|---|
Accuracy | 89% | 78% | 82% | 83% |
AUC Score | 93% | 79% | 90% | 87% |
Precision | 89% | 77% | 84% | 83% |
Recall | 84% | 77% | 71% | 77% |
False Positive Rate | 8% | 12% | 11% | 10% |
Audio Detection Leads Performance
Commercial audio detection achieves 89% accuracy with 93% AUC. This represents the highest performance across all modalities and approaches the estimated 90% accuracy threshold of human forensic analysts.
The balanced metrics show robust performance:
89% precision (few false alarms)
84% recall (catches most fakes)
Only 8% false positive rate
Video and Image Systems Trail Behind
Video detection reaches 78% accuracy with balanced 77% precision and recall. Image detection achieves 82% accuracy despite a lower 71% recall, meaning it misses 29% of manipulated images.
The Reliability Gap
Even top-performing commercial systems maintain 8-12% false positive rates and 16-29% false negative rates. This indicates continued development needs for mission-critical applications requiring higher reliability thresholds, such as:
Legal evidence evaluation
Financial fraud prevention
Election security monitoring
Celebrity impersonation protection
What This Means for Organizations
Commercial systems provide the best available protection but still fall short of the reliability needed for high-stakes decisions. Organizations should implement layered defense strategies combining:
Commercial-grade detection systems
Human verification for critical decisions
Continuous monitoring of detection performance
Regular retraining on emerging deepfake techniques
Learn More
For more information, you can learn more about Ceartas here and contact us through our integrated chat service if you have any questions.
Sources
Ceartas Research Study on Deepfake Detection Benchmarks: Ceartas
Human Performance in Detecting Deepfakes: A Systematic Review and Meta-Analysis: Diel, A., et al.: Computers in Human Behavior Reports | 2024
Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024: Chandra, N. A., et al.: TrueMedia.org platform, United States
Detection of Deepfakes: Performance Analysis Across Multiple Modalities: Multiple research institutions: Science Direct | 2024
Deepfake Statistics & Trends 2025: KeepNet Labs: United Kingdom | 2024-2025
Deepfake Statistics 2025: The Hidden Cyber Threat: SQ Magazine: United Kingdom | 2024-2025


Social Media Reality Check