Deepfake Detection Benchmark (Quarterly)

Between September 2024 and January 2025, our research team conducted a comprehensive analysis of deepfake detection performance across video, audio, and image modalities. Our methodology combined a systematic review of academic research with direct evaluation of 12 state-of-the-art detection models using the Deepfake-Eval-2024 dataset compiled through the TrueMedia.org platform and social media moderation forums.

This quarterly deepfake detection benchmark aggregates data from:

45 hours of video content
56.5 hours of audio recordings
1,975 images from 88 websites
Content spanning 52 languages
86,155 participants across 56 peer-reviewed studies

We evaluated both open-source and commercial detection models against real-world deepfakes circulating on social media platforms. The results reveal a critical gap between laboratory performance and actual deployment effectiveness.

How Accurate Is Deepfake Detection in 2025?

The following table presents overall detection accuracy rates across the three primary content types where deepfake manipulation occurs:

Detection Method	Video Accuracy	Audio Accuracy	Image Accuracy	Overall Performance
Human Detection (Untrained)	57%	62%	53%	55%
Human Detection (High-Quality Fakes)	25%	48%	41%	35%
Open-Source Models (Lab Conditions)	96%	100%	94%	97%
Open-Source Models (Real-World)	63%	53%	56%	57%
Commercial Systems	78%	89%	82%	83%

Human Detection Barely Beats Random Chance

Overall accuracy sits at just 55%. When confronted with high-quality deepfakes, performance collapses:

Video detection drops to 25% accuracy
Three out of four sophisticated fakes pass undetected
Audio fakes are identified correctly only 48% of the time
Image manipulation detection falls to 41%

This data comes from the most comprehensive meta-analysis of human deepfake detection ever conducted. The systematic review synthesized 56 papers involving 86,155 participants across multiple demographics and training levels.

The Automation Challenge

These findings create a critical dependency on automated detection systems. Yet they also highlight a fundamental challenge: even the best AI models struggle to significantly outperform humans in real-world conditions.

Why Lab Performance Doesn't Predict Real-World Results

Detection models experience dramatic accuracy drops when deployed against actual deepfakes compared to controlled laboratory testing.

Model Category	Lab Benchmark AUC	Real-World AUC	Performance Drop	Generalization Score
Video Detection (GenConViT)	96%	63%	50%	Poor
Audio Detection (AASIST)	100%	43%	48%	Poor
Audio Detection (RawNet2)	99%	53%	48%	Poor
Image Detection (UFD)	94%	56%	45%	Poor
Commercial Video Systems	89%	79%	11%	Moderate

The 50% Performance Collapse

Video detection models experience the most dramatic failure. Models like GenConViT achieve 96% accuracy on academic datasets like FaceForensics++. When tested against contemporary in-the-wild content, accuracy plummets to 63%, a 50% performance drop. AASIST drops from 100% accuracy on ASVspoof2019 to just 43% on real-world audio. RawNet2 falls from 99% to 53%.

Why Models Fail in the Real World

The Deepfake-Eval-2024 benchmark revealed the root cause across all major detection architectures:

Models learn to identify artifacts specific to training environments rather than developing robust generalization capabilities.

Academic datasets use outdated manipulation techniques, contain limited content diversity, and feature structured media that differs significantly from social media content.

Models trained on this data fail when confronted with:

Modern voice cloning technology
Diverse linguistic contexts (42+ languages)
Social media compression artifacts
Text overlays and background music
Diffusion model outputs like Stable Diffusion and DALL-E

Human Experts vs. AI Detection Systems: Who Wins?

Direct comparison between human analysts and automated detection systems reveals clear performance patterns across content types:

Content Type	Human Experts	Open-Source Models	Commercial Models	Best Performer
Video Deepfakes	57%	63%	78%	Commercial
Audio Deepfakes	62%	53%	89%	Commercial
Image Deepfakes	53%	56%	82%	Commercial
High-Quality Video	25%	48%	71%	Commercial
Non-English Audio	55%	46%	82%	Commercial

Commercial Systems Lead Across All Categories

Commercial detection systems outperform both human analysts and open-source models across every content type:

89% accuracy for audio (compared to 62% human, 53% open-source)
82% accuracy for images (compared to 53% human, 56% open-source)
78% accuracy for video (compared to 57% human, 63% open-source)

The Training Data Advantage

Commercial systems don't have better algorithms. They have better data. Superior performance appears primarily attributable to training on more representative datasets rather than architectural innovations. This is proven by open-source models approaching commercial performance when finetuned on in-the-wild data.

Where Humans Still Matter

For high-quality video deepfakes, human detection collapses to 25% accuracy while commercial systems maintain 71% performance. Non-English audio presents particular challenges for open-source models, with accuracy dropping to 46% compared to commercial systems' 82%.

These findings underscore the critical importance of:

Training data diversity
Continuous model updating against contemporary generation techniques
Official platform partnerships for data access

Content Characteristics That Break Detection Systems

Specific content attributes significantly impact detection accuracy, revealing systematic weaknesses across all model architectures.

Content Attribute	Baseline Accuracy	Accuracy with Attribute	Performance Impact	Affected Systems
Background Music (Audio)	78%	52%	-26%	All Audio Models
Text Overlays (Image)	65%	56%	-9%	Image Detectors
Non-English Language	70%	63%	-7%	Audio Detectors
Social Media Compression	68%	59%	-9%	All Modalities
Non-Facial Manipulation	72%	54%	-18%	Video Systems

Background Music: The Audio Killer

Audio detectors lose 26% accuracy when deepfakes contain background music. This represents the single largest vulnerability identified in the benchmark. Background music is common in generation workflows but rare in training datasets. Models simply haven't learned to separate manipulation artifacts from musical overlay effects.

Language Limitations

Non-English content reduces audio detection by 7%. This reflects a critical limitation: existing audio datasets typically include only English content, with at most two languages represented.

Real-world threats span 42 languages in contemporary benchmarks.

Social media compression artifacts reduce accuracy by 9% across all modalities. Models struggle to distinguish compression effects from manipulation traces.

Other Critical Vulnerabilities

Image systems: Text overlays reduce accuracy by 9%. These overlays are prevalent in social media sharing but absent from training data.

Video systems: Non-facial body modifications cause 18% accuracy reductions. Most video detection focuses on facial manipulation, missing full-body deepfakes.

Best Commercial Deepfake Detection Performance in 2025

Top-performing commercial models represent the current state-of-the-art for deployed detection technology, significantly outperforming open-source alternatives.

Performance Metric	Audio Systems	Video Systems	Image Systems	Cross-Modal Average
Accuracy	89%	78%	82%	83%
AUC Score	93%	79%	90%	87%
Precision	89%	77%	84%	83%
Recall	84%	77%	71%	77%
False Positive Rate	8%	12%	11%	10%

Audio Detection Leads Performance

Commercial audio detection achieves 89% accuracy with 93% AUC. This represents the highest performance across all modalities and approaches the estimated 90% accuracy threshold of human forensic analysts.

The balanced metrics show robust performance:

89% precision (few false alarms)
84% recall (catches most fakes)
Only 8% false positive rate

Video and Image Systems Trail Behind

Video detection reaches 78% accuracy with balanced 77% precision and recall. Image detection achieves 82% accuracy despite a lower 71% recall, meaning it misses 29% of manipulated images.

The Reliability Gap

Even top-performing commercial systems maintain 8-12% false positive rates and 16-29% false negative rates. This indicates continued development needs for mission-critical applications requiring higher reliability thresholds, such as:

Legal evidence evaluation
Financial fraud prevention
Election security monitoring
Celebrity impersonation protection

What This Means for Organizations

Commercial systems provide the best available protection but still fall short of the reliability needed for high-stakes decisions. Organizations should implement layered defense strategies combining:

Commercial-grade detection systems
Human verification for critical decisions
Continuous monitoring of detection performance
Regular retraining on emerging deepfake techniques

Learn More

For more information, you can learn more about Ceartas here and contact us through our integrated chat service if you have any questions.

Sources

Ceartas Research Study on Deepfake Detection Benchmarks: Ceartas
Human Performance in Detecting Deepfakes: A Systematic Review and Meta-Analysis: Diel, A., et al.: Computers in Human Behavior Reports | 2024
Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024: Chandra, N. A., et al.: TrueMedia.org platform, United States
Detection of Deepfakes: Performance Analysis Across Multiple Modalities: Multiple research institutions: Science Direct | 2024
Deepfake Statistics & Trends 2025: KeepNet Labs: United Kingdom | 2024-2025
Deepfake Statistics 2025: The Hidden Cyber Threat: SQ Magazine: United Kingdom | 2024-2025

Deepfake Detection Benchmark (Quarterly)

How Accurate Is Deepfake Detection in 2025?

Human Detection Barely Beats Random Chance

The Automation Challenge

Why Lab Performance Doesn't Predict Real-World Results

The 50% Performance Collapse

Why Models Fail in the Real World

Human Experts vs. AI Detection Systems: Who Wins?

Commercial Systems Lead Across All Categories

The Training Data Advantage

Where Humans Still Matter

Content Characteristics That Break Detection Systems

Background Music: The Audio Killer

Language Limitations

Other Critical Vulnerabilities

Best Commercial Deepfake Detection Performance in 2025

Audio Detection Leads Performance

Video and Image Systems Trail Behind

The Reliability Gap

What This Means for Organizations

Learn More

Sources

Keep Reading

Follow us

Deepfake Detection Benchmark (Quarterly)

How Accurate Is Deepfake Detection in 2025?

Human Detection Barely Beats Random Chance

The Automation Challenge

Why Lab Performance Doesn't Predict Real-World Results

The 50% Performance Collapse

Why Models Fail in the Real World

Human Experts vs. AI Detection Systems: Who Wins?

Commercial Systems Lead Across All Categories

The Training Data Advantage

Where Humans Still Matter

Content Characteristics That Break Detection Systems

Background Music: The Audio Killer

Language Limitations

Social Media Reality Check

Other Critical Vulnerabilities

Best Commercial Deepfake Detection Performance in 2025

Audio Detection Leads Performance

Video and Image Systems Trail Behind

The Reliability Gap

What This Means for Organizations

Learn More

Sources

Keep Reading

Follow us