Anonymous submission | Under double-blind review

Evaluating AI-Manipulated Video Detectors under Real-World Distribution Shifts

sec 01

The reality gap

Reality Perception Gap (RPG) exposes the deployment gap in AI-video forensics: detectors tuned on clean clips break when videos are platform re-encoded, post-processed, captured with camera artifacts, locally edited, or generated by frontier models. RPG makes those shifts visible, so reviewers can see where synthetic-video detection fails outside the lab.

sec 02

What is new in Reality Perception Gap

i.

Socially Fragile Content

  • Crime
  • Natural Disaster
  • CCTV
  • Traffic
  • War
ii.

Coupled Distribution Shifts

  • Social Media Compression
  • Retiming
  • Center Crop
  • Camera Optical Artifacts
iii.

Hybrid Manipulation

  • Character Replacement
  • Video Extension
  • Relighting
sec 03

Video demonstrations

Group 01

Socially fragile categories

High-impact domains where manipulated video can cause social harm: combat, traffic, natural disasters, and crime.

Combat footage
Traffic incident
Natural disaster
Crime scene
Group 02

Localized AI edits in real footage

Mixed-source clips where surrounding scene and camera statistics remain real while a region, attribute, or temporal continuation is synthesized.

Character replacement
Video extension
Relit footage
Group 03

Camera optical artifacts

Camera-side and camera-style artifact cases, including generated high-ISO noise, that reveal whether detectors confuse optical imperfections, sensor noise, and focus shifts with synthesis residuals.

Rolling shutter
Chromatic aberration
High ISO noise
Autofocus hunting
sec 04

Detector performance on RPG

Detector ACC Macro F1 AUC up TPR fake TNR real TPR at 1% FPR
D3 0.685 0.637 0.689 0.748 0.537 0.012
WaveRep 0.591 0.591 0.803 0.438 0.945 0.237
GenD 0.330 0.287 0.583 0.060 0.955 0.016
VideoFACT 0.562 0.479 0.583 0.562 0.561 0.025
FreqNet 0.404 0.404 0.548 0.285 0.679 0.025
AIGVDet 0.179 0.167 0.546 0.035 0.956 0.008
RINE 0.318 0.269 0.516 0.042 0.958 0.013
UFD 0.351 0.326 0.510 0.113 0.903 0.016
sec 05

Key findings

Compression as adversary

Platform encoding erases many frequency cues detectors are trained to read.

Deployment-grade compression removes fragile high-frequency residuals, exposing detectors that perform well only under clean laboratory conditions.

Generator coverage

Frontier generators, not data volume alone, set the binding constraint.

Per-generator behavior varies substantially, suggesting that future benchmarks need deliberate coverage of modern synthesis pipelines.

Threshold calibration

Default thresholds are unstable across realistic deployment shifts.

Detector operating points can drift toward always-real or always-fake behavior when clean evaluation assumptions are removed.

Optical decoys

Camera-style artifacts can look like generative artifacts.

Rolling shutter, chromatic aberration, generated high ISO noise, and autofocus hunting stress whether models distinguish capture physics from synthesis traces.

Video demo