Between September 2024 and January 2025, our research team conducted a comprehensive analysis of deepfake detection performance across video, audio, and image modalities. Our methodology combined a systematic review of academic research with direct evaluation of 12 state-of-the-art detection models using the Deepfake-Eval-2024 dataset compiled through the TrueMedia.org platform and social media moderation forums.

This quarterly deepfake detection benchmark aggregates data from:

  • 45 hours of video content

  • 56.5 hours of audio recordings

  • 1,975 images from 88 websites

  • Content spanning 52 languages

  • 86,155 participants across 56 peer-reviewed studies

We evaluated both open-source and commercial detection models against real-world deepfakes circulating on social media platforms. The results reveal a critical gap between laboratory performance and actual deployment effectiveness.

How Accurate Is Deepfake Detection in 2025?

The following table presents overall detection accuracy rates across the three primary content types where deepfake manipulation occurs:

Detection Method

Video Accuracy

Audio Accuracy

Image Accuracy

Overall Performance

Human Detection (Untrained)

57%

62%

53%

55%

Human Detection (High-Quality Fakes)

25%

48%

41%

35%

Open-Source Models (Lab Conditions)

96%

100%

94%

97%

Open-Source Models (Real-World)

63%

53%

56%

57%

Commercial Systems

78%

89%

82%

83%

Human Detection Barely Beats Random Chance

Overall accuracy sits at just 55%. When confronted with high-quality deepfakes, performance collapses:

  • Video detection drops to 25% accuracy

  • Three out of four sophisticated fakes pass undetected

  • Audio fakes are identified correctly only 48% of the time

  • Image manipulation detection falls to 41%

This data comes from the most comprehensive meta-analysis of human deepfake detection ever conducted. The systematic review synthesized 56 papers involving 86,155 participants across multiple demographics and training levels.

The Automation Challenge

These findings create a critical dependency on automated detection systems. Yet they also highlight a fundamental challenge: even the best AI models struggle to significantly outperform humans in real-world conditions.

Why Lab Performance Doesn't Predict Real-World Results

Detection models experience dramatic accuracy drops when deployed against actual deepfakes compared to controlled laboratory testing.

Model Category

Lab Benchmark AUC

Real-World AUC

Performance Drop

Generalization Score

Video Detection (GenConViT)

96%

63%

50%

Poor

Audio Detection (AASIST)

100%

43%

48%

Poor

Audio Detection (RawNet2)

99%

53%

48%

Poor

Image Detection (UFD)

94%

56%

45%

Poor

Commercial Video Systems

89%

79%

11%

Moderate

The 50% Performance Collapse

Video detection models experience the most dramatic failure. Models like GenConViT achieve 96% accuracy on academic datasets like FaceForensics++. When tested against contemporary in-the-wild content, accuracy plummets to 63%, a 50% performance drop. AASIST drops from 100% accuracy on ASVspoof2019 to just 43% on real-world audio. RawNet2 falls from 99% to 53%.

Why Models Fail in the Real World

The Deepfake-Eval-2024 benchmark revealed the root cause across all major detection architectures:

Models learn to identify artifacts specific to training environments rather than developing robust generalization capabilities.

Academic datasets use outdated manipulation techniques, contain limited content diversity, and feature structured media that differs significantly from social media content.

Models trained on this data fail when confronted with:

  • Modern voice cloning technology

  • Diverse linguistic contexts (42+ languages)

  • Social media compression artifacts

  • Text overlays and background music

  • Diffusion model outputs like Stable Diffusion and DALL-E

Human Experts vs. AI Detection Systems: Who Wins?

Direct comparison between human analysts and automated detection systems reveals clear performance patterns across content types:

Content Type

Human Experts

Open-Source Models

Commercial Models

Best Performer

Video Deepfakes

57%

63%

78%

Commercial

Audio Deepfakes

62%

53%

89%

Commercial

Image Deepfakes

53%

56%

82%

Commercial

High-Quality Video

25%

48%

71%

Commercial

Non-English Audio

55%

46%

82%

Commercial

Commercial Systems Lead Across All Categories

Commercial detection systems outperform both human analysts and open-source models across every content type:

  • 89% accuracy for audio (compared to 62% human, 53% open-source)

  • 82% accuracy for images (compared to 53% human, 56% open-source)

  • 78% accuracy for video (compared to 57% human, 63% open-source)

The Training Data Advantage

Commercial systems don't have better algorithms. They have better data. Superior performance appears primarily attributable to training on more representative datasets rather than architectural innovations. This is proven by open-source models approaching commercial performance when finetuned on in-the-wild data.

Where Humans Still Matter

For high-quality video deepfakes, human detection collapses to 25% accuracy while commercial systems maintain 71% performance. Non-English audio presents particular challenges for open-source models, with accuracy dropping to 46% compared to commercial systems' 82%.

These findings underscore the critical importance of:

  • Training data diversity

  • Continuous model updating against contemporary generation techniques

  • Official platform partnerships for data access

Content Characteristics That Break Detection Systems

Specific content attributes significantly impact detection accuracy, revealing systematic weaknesses across all model architectures.

Content Attribute

Baseline Accuracy

Accuracy with Attribute

Performance Impact

Affected Systems

Background Music (Audio)

78%

52%

-26%

All Audio Models

Text Overlays (Image)

65%

56%

-9%

Image Detectors

Non-English Language

70%

63%

-7%

Audio Detectors

Social Media Compression

68%

59%

-9%

All Modalities

Non-Facial Manipulation

72%

54%

-18%

Video Systems

Background Music: The Audio Killer

Audio detectors lose 26% accuracy when deepfakes contain background music. This represents the single largest vulnerability identified in the benchmark. Background music is common in generation workflows but rare in training datasets. Models simply haven't learned to separate manipulation artifacts from musical overlay effects.

Language Limitations

Non-English content reduces audio detection by 7%. This reflects a critical limitation: existing audio datasets typically include only English content, with at most two languages represented.

Real-world threats span 42 languages in contemporary benchmarks.

Social Media Reality Check

Social media compression artifacts reduce accuracy by 9% across all modalities. Models struggle to distinguish compression effects from manipulation traces.

Other Critical Vulnerabilities

Image systems: Text overlays reduce accuracy by 9%. These overlays are prevalent in social media sharing but absent from training data.

Video systems: Non-facial body modifications cause 18% accuracy reductions. Most video detection focuses on facial manipulation, missing full-body deepfakes.

Best Commercial Deepfake Detection Performance in 2025

Top-performing commercial models represent the current state-of-the-art for deployed detection technology, significantly outperforming open-source alternatives.

Performance Metric

Audio Systems

Video Systems

Image Systems

Cross-Modal Average

Accuracy

89%

78%

82%

83%

AUC Score

93%

79%

90%

87%

Precision

89%

77%

84%

83%

Recall

84%

77%

71%

77%

False Positive Rate

8%

12%

11%

10%

Audio Detection Leads Performance

Commercial audio detection achieves 89% accuracy with 93% AUC. This represents the highest performance across all modalities and approaches the estimated 90% accuracy threshold of human forensic analysts.

The balanced metrics show robust performance:

  • 89% precision (few false alarms)

  • 84% recall (catches most fakes)

  • Only 8% false positive rate

Video and Image Systems Trail Behind

Video detection reaches 78% accuracy with balanced 77% precision and recall. Image detection achieves 82% accuracy despite a lower 71% recall, meaning it misses 29% of manipulated images.

The Reliability Gap

Even top-performing commercial systems maintain 8-12% false positive rates and 16-29% false negative rates. This indicates continued development needs for mission-critical applications requiring higher reliability thresholds, such as:

  • Legal evidence evaluation

  • Financial fraud prevention

  • Election security monitoring

  • Celebrity impersonation protection

What This Means for Organizations

Commercial systems provide the best available protection but still fall short of the reliability needed for high-stakes decisions. Organizations should implement layered defense strategies combining:

  • Commercial-grade detection systems

  • Human verification for critical decisions

  • Continuous monitoring of detection performance

  • Regular retraining on emerging deepfake techniques

Learn More

For more information, you can learn more about Ceartas here and contact us through our integrated chat service if you have any questions.

Sources

  1. Ceartas Research Study on Deepfake Detection Benchmarks: Ceartas

  2. Human Performance in Detecting Deepfakes: A Systematic Review and Meta-Analysis: Diel, A., et al.: Computers in Human Behavior Reports | 2024

  3. Detection of Deepfakes: Performance Analysis Across Multiple Modalities: Multiple research institutions: Science Direct | 2024

  4. Deepfake Statistics & Trends 2025: KeepNet Labs: United Kingdom | 2024-2025

  5. Deepfake Statistics 2025: The Hidden Cyber Threat: SQ Magazine: United Kingdom | 2024-2025


Keep Reading

No posts found