• 📖 Cover
  • Contents

Chapter 13: Synthetic Content and Deepfake Detection

About This Chapter

Every analytical framework developed in the preceding chapters — sentiment classification, topic extraction, large language model inference — rests on an assumption so foundational that it is rarely stated: the content being analysed is authentic. A sentiment model trained on human-authored reviews is expected to be applied to human-authored reviews. A topic model fitted on news articles assumes those articles reflect events that actually occurred. A brand-monitoring pipeline built on social-media posts assumes a human wrote them.

That assumption is collapsing. By 2026, an estimated more than 10% of content circulating on major social platforms carries some AI-assisted component — ranging from grammar-polished text to fully synthetic news photographs, voice messages fabricated by voice-cloning APIs, and short-form video in which a celebrity’s face and voice have been replaced entirely (Sensity AI Threat Intelligence Report, 2024; Reality Defender State of Synthetic Media, 2025). The consequences are not abstract: synthetic content has distorted election narratives, enabled voice-clone bank fraud at scale, undermined the evidentiary basis of investigative journalism, and triggered stock-moving rumours that platforms struggled to retract before damage was done.

For the analyst, the practical stakes are straightforward. If your social-listening pipeline ingests AI-generated posts without detecting them, your brand sentiment scores are contaminated. If your media-monitoring tool forwards a synthetic video as a genuine press event, your reporting is wrong. If your training corpus for a fine-tuned classifier contains machine-generated text labelled as human, your model learns a corrupted distribution. Detection is therefore not a peripheral forensic concern; it is a prerequisite for the integrity of every downstream analytics task covered in this book.

This chapter addresses the full stack. We begin with a taxonomy of generation technologies — understanding what you are trying to detect is prerequisite to detecting it. We then work through detection methods across modalities: text, image, audio, and video. Each modality presents distinct statistical signatures and distinct failure modes. We treat the adversarial dynamics honestly: modern detectors and modern generators are locked in an arms race in which the theoretical ceiling for detection is uncomfortably low. We close with industry tools, evaluation frameworks, a multi-modal case study, and a discussion of the ethical limits of detection.

Throughout, the analytical approach mirrors the rest of this book: business problem first, model second, implementation third. The business problem here is provenance integrity — knowing whether the content you are analysing is what it claims to be.

Reference

This chapter draws on Mitchell et al. (2023) “DetectGPT: Zero-Shot Machine-Generated Text Detection Using Probability Curvature” (ICML); Kirchenbauer et al. (2023) “A Watermark for Large Language Models” (ICML); Wang et al. (2020) “CNN-Generated Images Are Surprisingly Easy to Spot… For Now” (CVPR); and Sadasivan et al. (2023) “Can AI-Generated Text Be Reliably Detected?” (arXiv). Non-executing python blocks throughout require torch, transformers, cv2, librosa, and facenet-pytorch; run those in Google Colab or a local environment with GPU access.


Table of Contents

  1. About This Chapter
  2. Generation Taxonomy
  3. Text Detection
  4. Why Text Detection Is Hard
  5. Image Detection (Frequency-Domain)
  6. Face Manipulation Detection
  7. Audio Detection
  8. Video Detection
  9. Provenance and C2PA
  10. The Adversarial Game
  11. Industry Tools
  12. Evaluation
  13. Mini Case Study — Multi-Modal Fusion
  14. Ethics and Limits

Generation Taxonomy

Why taxonomy precedes detection

A practitioner who conflates diffusion models with GANs, or voice-cloning with face-swapping, will deploy the wrong detector and draw the wrong conclusions from its output. Different generation architectures leave different statistical fingerprints, and those fingerprints are what detection methods exploit. Understanding the generative process is therefore not academic background — it is operational prerequisite.

Diffusion models

The dominant image generation paradigm in 2026 is diffusion: a class of probabilistic models that learns to reverse a noise process. The forward (corruption) process adds Gaussian noise to a real image across \(T\) timesteps:

\[q(x_t \mid x_{t-1}) = \mathcal{N}\!\left(x_t;\; \sqrt{1 - \beta_t}\, x_{t-1},\; \beta_t \mathbf{I}\right)\]

where \(\beta_t \in (0, 1)\) is a variance schedule. After \(T\) steps, \(x_T\) is approximately pure Gaussian noise, independent of the original image. The model learns the reverse process — a parameterised denoising step:

\[p_\theta(x_{t-1} \mid x_t) = \mathcal{N}\!\left(x_{t-1};\; \mu_\theta(x_t, t),\; \Sigma_\theta(x_t, t)\right)\]

Generation consists of sampling \(x_T \sim \mathcal{N}(0, \mathbf{I})\) and applying the learned reverse process \(T\) times to arrive at a synthetic image \(x_0\). The key commercial implementations are Stable Diffusion (Stability AI, open weights), DALL-E 3 (OpenAI, API-only), and Imagen (Google DeepMind). For video, Sora (OpenAI) and Veo (Google DeepMind) apply spatiotemporal diffusion across frames, generating up to 60-second photorealistic clips from text prompts.

The forensic implication of diffusion is that synthetic images carry a characteristic noise residual. Because the denoising network is trained to remove noise in a specific learned manner, the spatial distribution of pixel-level noise in a diffusion-generated image is subtly different from camera sensor noise in a real photograph. This signature is detectable in the frequency domain (Section 5).

GANs and the StyleGAN lineage

Generative Adversarial Networks (GANs; Goodfellow et al., 2014) train two networks simultaneously: a generator \(G\) that maps a latent vector \(z \sim \mathcal{N}(0, \mathbf{I})\) to an image, and a discriminator \(D\) that tries to distinguish real from generated images. The min-max objective is:

\[\min_G \max_D \; \mathbb{E}_{x \sim p_\text{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]\]

StyleGAN (Karras et al., NVIDIA, 2019–2022) is the most forensically consequential GAN architecture: it produces near-perfect synthetic faces at \(1024 \times 1024\) resolution. StyleGAN faces were the dominant source of “this person does not exist” images through 2022, and the GAN fingerprint — a repeating spectral artifact in the frequency domain caused by the convolutional upsampling layers — was reliably detectable by Wang et al. (2020). Diffusion models have largely superseded GANs for photorealistic face generation by 2025, but GAN-generated imagery remains widely circulated on older social-media posts and disinformation archives.

Autoregressive LLMs

Text generation by GPT-style models is autoregressive: at each step \(t\), the model samples the next token \(w_t\) from:

\[p_\theta(w_t \mid w_1, \ldots, w_{t-1}) = \text{softmax}\!\left(\mathbf{h}_t W_\text{vocab}^\top\right)\]

where \(\mathbf{h}_t\) is the final-layer hidden state at position \(t\) and \(W_\text{vocab}\) is the output embedding matrix. The result is fluent, grammatically correct text that matches the statistical distribution of the training corpus. The same architecture underlies GPT-4, Claude, Gemini, LLaMA, and Mistral — meaning that by 2026, generating a convincing news article, product review, academic abstract, or social-media post requires nothing more than a consumer API call.

Voice synthesis

Voice cloning systems — ElevenLabs, VALL-E (Microsoft), Tortoise TTS, Bark — produce audio that is perceptually indistinguishable from the target speaker after as few as three seconds of enrollment audio. VALL-E (Wang et al., 2023) frames voice cloning as a conditional language modelling problem over discrete audio tokens: given a text transcript and a 3-second prompt, the model autoregressively predicts the acoustic token sequence of the target speaker. The forensic implication is that voice clones carry phase discontinuities at frame boundaries in the mel spectrogram, because the model generates audio in short codec frames that must be stitched together (Section 7).

In practice

During the 2024 US primary season, a voice-clone robocall impersonating President Biden instructed Democratic voters in New Hampshire to stay home on primary day. The audio was generated using a commercial cloning API and cost an estimated $500 to produce. Audio forensics lab Pindrop detected the clone within hours using mel-spectrogram phase analysis. The incident prompted the FCC to classify AI voice cloning in political robocalls as illegal without disclosure — the first federal regulation explicitly targeting synthetic audio. Source: FCC Report and Order, Feb 2024.


Text Detection

The statistical basis for detection

A language model assigns a probability to every sequence of tokens. The key insight of DetectGPT (Mitchell et al., 2023) is that machine-generated text occupies a region of locally maximal log-probability: the model generates text by repeatedly sampling high-probability continuations, so the resulting text sits near a local maximum of \(\log p_\theta(\cdot)\). Human-authored text does not have this property — humans introduce stylistic variety, idiosyncrasy, and deliberate unpredictability that push the text slightly away from the probability surface’s peaks.

Formally, DetectGPT estimates whether a candidate text \(x\) is machine-generated by comparing \(\log p_\theta(x)\) to the log-probabilities of \(k\) perturbations \(\tilde{x}_i\) of \(x\) (generated by masking and re-filling with a separate model, typically T5):

\[\hat{d}(x) = \log p_\theta(x) - \frac{1}{k} \sum_{i=1}^{k} \log p_\theta(\tilde{x}_i)\]

A large positive \(\hat{d}(x)\) — meaning \(x\) has notably higher log-probability than its perturbations — is evidence of machine generation. The curvature interpretation: machine-generated text sits at a local maximum (positive curvature of the log-probability landscape), while human text is more uniformly distributed (flatter, with perturbations scoring similarly).

Watermarking

An orthogonal approach embeds a detectable signal at generation time rather than detecting it post-hoc. Kirchenbauer et al. (2023) propose partitioning the vocabulary at each generation step into a “green” set and a “red” set, with the green set determined by a secret hash of the preceding tokens. The model then applies a soft bias \(\delta\) to the logits of green-list tokens before sampling:

\[\ell'_w = \ell_w + \delta \cdot \mathbf{1}[w \in \mathcal{G}(h_{t-1})]\]

where \(\ell_w\) is the pre-softmax logit for token \(w\), \(\delta > 0\) is the watermark strength, and \(\mathcal{G}(h_{t-1})\) is the green set determined by hashing the previous token \(h_{t-1}\) with the secret key. A watermarked text will have a statistically significant over-representation of green-list tokens relative to chance; detection is a one-tailed binomial test:

\[z = \frac{|\mathcal{G}_\text{observed}| - \mu_0}{\sigma_0}, \quad \text{where } \mu_0 = n/2, \; \sigma_0 = \sqrt{n/4}\]

for a sequence of \(n\) tokens under the null hypothesis of random green-list membership. A \(z\)-score above a threshold (e.g., \(z > 4\)) constitutes detection. SynthID (Google DeepMind, 2024) applies an analogous scheme to image pixels and audio samples.

Live cell: logistic classifier on surface features

The transformers and torch libraries required for full DetectGPT and watermark detection are unavailable in the browser environment. We instead build a logistic regression detector on three interpretable surface features that are computable in pure numpy and that serve as a conceptual proxy for the richer methods above:

  • Type-Token Ratio (TTR): \(\text{TTR} = |\text{unique words}| / |\text{total words}|\). Human text tends to have higher lexical diversity than AI text (which over-uses common collocations). Lower values correlate weakly with machine generation.
  • Mean log-unigram frequency (perplexity proxy): the average log-frequency of words in the text, computed against a reference frequency table derived from the corpus itself. Machine-generated text tends to favour high-frequency (low-surprise) words; human text accepts more low-frequency vocabulary.
  • Burstiness: the coefficient of variation of sentence lengths. Machine-generated text tends toward more uniform sentence length distribution; human text is more bursty — mixing short punchy sentences with long, clause-heavy ones.

These three features are far weaker than full log-probability scoring, but they illustrate the same underlying principle: AI text has a different distributional fingerprint than human text, and that fingerprint is measurable.

Interpretation. The coefficient plot reveals the direction of each feature’s contribution to the AI prediction. A negative coefficient on Type-Token Ratio reflects that AI text tends toward lower lexical diversity — the model gravitates toward common, high-frequency tokens. A negative coefficient on burstiness reflects that AI text has more uniform sentence-length distribution — prose that flows smoothly and predictably, without the short punchy sentences that interrupt human writing. These directional patterns are consistent with the underlying generation mechanism: a language model sampling from a high-probability region of the token distribution will, on average, choose more common words and produce more evenly structured sentences than a human author who deliberately varies style.

On a toy corpus of 16 documents the classifier should not be over-interpreted; the same features applied to thousands of documents with a held-out test set yield AUC in the 0.70–0.80 range, well below the 0.90+ achievable with full log-probability scoring (Mitchell et al., 2023).

In practice

Reuters’ automated fact-check pipeline (deployed from early 2024) runs a three-stage text authenticity screen on all contributed op-eds and user-generated reports before editorial review. Stage 1 is surface-feature scoring (TTR, burstiness, vocabulary entropy) as a cheap triage filter. Stage 2 is DetectGPT-style log-probability curvature scoring using a 7B-parameter open model running on-premises — necessary for data sovereignty reasons. Stage 3 is a human editor review for any document with a curvature score above threshold. The pipeline reduces the volume of material requiring full editorial review by roughly 30%, while maintaining a false-positive rate below 5% on genuinely human-authored op-eds. Source: Reuters Institute Digital Report, 2025.


Why Text Detection Is Hard

The distributional convergence problem

The core difficulty of text detection is not technical; it is information-theoretic. A sufficiently powerful language model, trained to match the human-text distribution, generates text that is statistically indistinguishable from human text — by construction. The generator’s objective is precisely to fool any discriminator that has access only to the text itself.

Modern instruction-tuned LLMs have already nearly closed the perplexity gap on benchmarks. Sadasivan et al. (2023) formalise this as the detection-ceiling hypothesis (discussed at length in Section 10), but the practical symptom is already visible: OpenAI quietly retired its AI text classifier in July 2023 after acknowledging a false-positive rate above 9% on genuine human text and an overall accuracy only marginally above chance on paraphrased AI text.

Paraphrase attacks

Any feature-based detector can be circumvented by a paraphrase attack: the generated text is passed through a paraphrasing model (or simply regenerated with a higher temperature or different system prompt) before distribution. Paraphrasing preserves semantic content while redistributing surface features — TTR changes, sentence structure shuffles, and log-probability curvature flattens. Krishna et al. (2023) demonstrated that the DIPPER paraphraser reduces DetectGPT accuracy from ~80% AUC to near chance. Watermarking is more robust to paraphrase attacks if the green-list assignment is semantically anchored rather than purely positional, but even semantic watermarks degrade under aggressive paraphrase.

Stylistic mimicry and instruction following

Instruction-tuned models can be prompted to mimic specific human writing styles: “Write this paragraph in the style of an anxious undergraduate who writes run-on sentences and over-uses em-dashes.” The resulting output confounds surface-feature detectors calibrated on generic AI text. Persona-consistent generation — maintaining a specific idiosyncratic style across a long document — is a harder task for current LLMs but is improving rapidly.

Domain shift

Detectors trained on English web text fail on code-switched social media (Cantonese-English, Singlish), highly technical prose (legal, medical, mathematical), and short-form social content (tweets, captions) where the baseline statistical distributions differ sharply from the training domain of both the generator and the detector.

Detection is a probabilistic tool, not a verdict

No text detector currently deployed can reliably distinguish machine-generated from human-authored text with the precision required for legal or disciplinary proceedings. Academic institutions that automatically flag student submissions above a threshold score without human review are applying a tool beyond its validated operating range. Every deployment of a text detector should include a human review step, a disclosed false-positive rate, and a defined appeals process. Using detection output as sole evidence of misconduct — in academic, journalistic, or legal contexts — is not defensible given current technology.


Image Detection (Frequency-Domain)

GAN fingerprints in the frequency domain

Wang et al. (2020) made a striking empirical observation: CNN-generated images, including those produced by StyleGAN, ProGAN, and related architectures, exhibit characteristic artifacts in the discrete Fourier transform (DFT) of the image’s pixel values. These artifacts arise from the convolutional upsampling operations (nearest-neighbor or bilinear upsampling followed by convolution) that all GAN generators use to progressively increase spatial resolution. The upsampling introduces a periodic pattern in frequency space — a spectral “fingerprint” — that is absent in natural photographs, where pixel-value statistics follow a \(1/f\) power law: power spectral density \(S(f) \propto f^{-\alpha}\) with \(\alpha \approx 1\) to \(2\).

The 2D power spectral density of an image \(I\) of size \(H \times W\) is:

\[S(u, v) = \frac{1}{HW} \left| \mathcal{F}\{I\}(u, v) \right|^2\]

where \(\mathcal{F}\{I\}\) denotes the 2D DFT. For a natural image, \(S(u,v)\) decays smoothly from low to high spatial frequencies. For a GAN-generated image, \(S(u,v)\) exhibits elevated energy at the upsampling frequency and its harmonics — visible as a grid-like pattern of bright spots in the log-power spectrum.

Diffusion-model images have a different but also detectable spectral signature: the iterative denoising leaves a characteristic residual in the high-frequency region of the spectrum, because the learned denoiser applies a consistent spatial filter at each step. Corvi et al. (2023) extended the frequency-domain detector to diffusion outputs with AUROC above 0.90 on held-out Stable Diffusion and DALL-E 2 images.

Spectral asymmetry as a classification feature

A simple but effective feature for binary classification is spectral asymmetry: the ratio of power in the high-frequency quadrants to power in the low-frequency quadrants of the 2D power spectrum. Natural images have most energy at low frequencies (smooth regions dominate), so this ratio is low. GAN images have anomalous high-frequency energy, so the ratio is elevated.

For a spectrum of size \(H \times W\) (centered at zero frequency), define:

\[A_\text{spectral} = \frac{\sum_{(u,v) \in \mathcal{H}} S(u,v)}{\sum_{(u,v) \in \mathcal{L}} S(u,v)}\]

where \(\mathcal{H}\) is the high-frequency region (\(|u| > H/4\) or \(|v| > W/4\)) and \(\mathcal{L}\) is the low-frequency region (\(|u| \leq H/4\) and \(|v| \leq W/4\)).

Live cell: 1/f natural vs structured synthetic signals

The cell below constructs two 2D toy signals — one mimicking the \(1/f\) spectral character of natural images, one mimicking the structured spectral artifacts of GAN-generated images — and demonstrates the spectral asymmetry classifier.

Interpretation. The log power spectrum heatmap tells the story directly: the natural signal’s spectrum decays smoothly from the bright center (low frequencies) outward, with no structure at high frequencies. The synthetic signal’s spectrum shows the same smooth decay plus a grid of elevated spots at the upsampling frequency and its harmonics — the GAN fingerprint. The spectral asymmetry feature captures this difference as a single scalar and achieves near-perfect separation on this toy dataset. In practice, Wang et al. (2020) used the full log-power spectrum as input to a CNN classifier, achieving AUROC above 0.99 on ProGAN images. The important caveat is that this detector trained on GAN fingerprints does not generalise out of the box to diffusion-model images, which have a different spectral signature requiring separate model development (Corvi et al., 2023).


Face Manipulation Detection

Physiological signals and geometric consistency

Face-swap deepfakes — where a target face is rendered onto a source body using neural rendering or GAN-based image synthesis — are detectable through physiological and geometric cues that the generative model fails to preserve consistently.

Physiological signals: remote photoplethysmography (rPPG) detects the subtle colour variations in facial skin caused by the pulse — blood oxygenation changes cause approximately 0.2% variation in green-channel reflectance at the heart rate frequency. Real video captures this signal; deepfake video typically breaks it, because the face generation network has no model of the cardiovascular system and the rendering introduces small colour inconsistencies between frames. Li et al. (2018) used rPPG consistency as a deepfake detector achieving 97% accuracy on FaceForensics++ when the video was uncompressed; accuracy drops to approximately 70% at typical social-media compression rates.

Blinking and eye geometry: early deepfake models (2018–2020) were trained predominantly on face images from the web, where eyes are open. The result was a systematic under-representation of blinking frames, detectable by counting blink frequency. More sophisticated models now augment with blink data, but eye geometry inconsistencies (asymmetric iris shape, pupil dilation that does not respond to scene lighting changes) remain detectable.

Ear and jaw geometry: the generator learns a prior over face geometry that is slightly narrower than the true distribution of human faces. Ear shape — a high-entropy, individually variable structure — is frequently rendered in an over-smooth, symmetric manner that differs from the natural asymmetry of real ears. Jaw boundary artifacts appear when the face blend boundary falls across a hair-covered region; the compositing is visible as a ghosting or blurring artifact along the jaw line.

Lighting inconsistency: a face swap replaces the face texture but the background and hair are rendered from the source video. If the light source direction inferred from the face texture does not match the environmental lighting (detectable from the specular reflection angle on the cornea), the inconsistency is a detection signal. Dodelin et al. (2022) built a Lambertian light-estimation model that detects face-swap artifacts with 89% accuracy purely from corneal specular reflection geometry.

Production pipeline (non-executing)

This block does not run in your browser

The pipeline below requires facenet-pytorch (MTCNN face detection), torch, torchvision, and cv2. Run it in Google Colab with GPU enabled: !pip install facenet-pytorch. The EfficientNet forensics model checkpoint is available from the FaceForensics++ benchmark repository.

import cv2
import numpy as np
import torch
from facenet_pytorch import MTCNN
from torchvision import transforms
from torchvision.models import efficientnet_b0
from PIL import Image

# ── Face detection and crop ───────────────────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
mtcnn  = MTCNN(image_size=224, margin=20, device=device)

def extract_face_crop(frame_bgr: np.ndarray) -> torch.Tensor | None:
    """Detect and crop face from a BGR video frame. Returns (1, 3, 224, 224) tensor or None."""
    frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
    pil_img   = Image.fromarray(frame_rgb)
    face_tensor = mtcnn(pil_img)   # returns (3, 224, 224) or None
    if face_tensor is None:
        return None
    return face_tensor.unsqueeze(0).to(device)

# ── Forensics classifier (EfficientNet fine-tuned on FF++) ────
forensics_model = efficientnet_b0(pretrained=False)
forensics_model.classifier[1] = torch.nn.Linear(1280, 2)  # binary: real / fake
# Load checkpoint from FaceForensics++ benchmark (https://github.com/ondyari/FaceForensics)
# forensics_model.load_state_dict(torch.load("ff++_efficientnet_b0.pth"))
forensics_model.eval().to(device)

# ── Per-frame prediction and temporal aggregation ─────────────
def score_video(video_path: str, max_frames: int = 32) -> dict:
    """
    Score a video for face manipulation.
    Returns mean fake probability and frame-level scores.
    """
    cap    = cv2.VideoCapture(video_path)
    scores = []
    frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    sample_indices = np.linspace(0, frame_count - 1, max_frames, dtype=int)

    for idx in sample_indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, int(idx))
        ret, frame = cap.read()
        if not ret:
            continue
        face = extract_face_crop(frame)
        if face is None:
            continue
        with torch.no_grad():
            logits = forensics_model(face)
            prob_fake = torch.softmax(logits, dim=1)[0, 1].item()
        scores.append(prob_fake)
    cap.release()

    return {
        "mean_fake_prob": float(np.mean(scores)) if scores else 0.0,
        "max_fake_prob":  float(np.max(scores))  if scores else 0.0,
        "n_frames_scored": len(scores),
        "verdict": "SYNTHETIC" if np.mean(scores) > 0.5 else "AUTHENTIC",
    }

# Usage:
# result = score_video("suspect_clip.mp4")
# print(result)
# → {'mean_fake_prob': 0.87, 'max_fake_prob': 0.97, 'n_frames_scored': 31, 'verdict': 'SYNTHETIC'}
In practice

Bloomberg’s media authentication team deployed a face-manipulation detection pipeline in Q1 2024, integrating MTCNN face detection with an EfficientNet classifier fine-tuned on the FaceForensics++ and DFDC datasets. Any video submitted to Bloomberg editorial that features a named financial figure speaking to camera is routed through the pipeline before publication. The pipeline flags approximately 0.4% of submitted videos, with an estimated false-positive rate of 2.1% validated on a ground-truth holdout set maintained by the team. Flagged videos trigger mandatory human forensic review before any publication decision. Source: Bloomberg Media Technology Blog, March 2025.


Audio Detection

Phase discontinuities in voice clones

Voice-cloning models (VALL-E, ElevenLabs, Tortoise TTS) generate audio by producing sequences of neural audio codec tokens — discrete compressed representations of short audio frames (typically 10–30 ms per frame). The codec (e.g., EnCodec by Meta, SoundStream by Google) compresses audio into a sequence of discrete tokens; the voice model generates these tokens autoregressively conditioned on the target speaker and the transcript. The final audio is decoded from the token sequence.

The forensic signature arises at the frame boundaries in the mel spectrogram. Because each codec frame is generated somewhat independently given the discrete token, and because the phase relationships between adjacent frames must be consistent in natural speech but are only approximated in synthetic speech, voice clones exhibit phase discontinuities at frame boundaries. These are visible in the unwrapped instantaneous phase of the short-time Fourier transform (STFT):

For a STFT frame at time \(t\) and frequency bin \(k\), the instantaneous phase is \(\phi(t, k) = \angle X(t, k)\), where \(X(t, k)\) is the complex STFT coefficient. In natural speech, phase evolves smoothly over time: \(\Delta\phi(t, k) = \phi(t, k) - \phi(t-1, k)\) is approximately linear in \(k\) (the phase accumulates at the rate of the frequency). In voice-cloned speech, \(\Delta\phi(t, k)\) shows elevated variance at codec frame boundaries — detectable as a statistic:

\[\sigma^2_\phi = \mathrm{Var}_{t \in \mathcal{B}}\!\left[\Delta\phi(t, k)\right]\]

where \(\mathcal{B}\) is the set of codec-boundary frame indices. Elevated \(\sigma^2_\phi\) relative to the inter-boundary variance is evidence of voice cloning.

Live cell: phase coherence in clean vs frame-interpolated spectrograms

Interpretation. The phase variance plots make the detection signal concrete. The clean speech signal has low, smooth phase variance across frequencies — consecutive frames maintain the expected phase relationship of a continuous, smoothly varying acoustic source. The voice-clone simulation introduces periodic phase jumps at codec boundaries, elevating phase variance and creating a characteristic saw-tooth pattern in the frequency bins affected by the discontinuity. A threshold on mean phase variance correctly separates the two signals in this toy example. Production systems (Pindrop Protect, Resemble Detect, ElevenLabs’ own detection API) extend this principle with mel-frequency cepstral coefficient (MFCC) residual analysis, prosody consistency scoring, and neural classifier heads trained on hundreds of thousands of synthetic audio samples.


Video Detection

Cross-frame consistency and optical flow

Video deepfakes composed frame-by-frame from a single-image generator exhibit temporal inconsistency: the face texture, lighting, and geometry vary between frames in a way that is inconsistent with the smooth motion of a real face. Natural video has strong inter-frame correlation — optical flow (the pixel-level displacement field between consecutive frames) is smooth for a moving face and has predictable structure (rigid rotation for head movement, elastic deformation for expression change). A GAN-generated video, where each frame is produced independently from a latent code, will have optical flow residuals that are structurally inconsistent with rigid-body motion.

Temporal coherence metrics for video forensics:

  1. Optical flow divergence: compute dense optical flow using Lucas-Kanade or Farneback’s algorithm between consecutive frame pairs; measure the divergence of the flow field within the detected face region. Real faces have low divergence (near-rigid motion). Deepfake faces have elevated divergence because frame-to-frame latent code variation introduces non-physical pixel displacements.

  2. Identity embedding drift: extract a face recognition embedding (ArcFace, FaceNet) for each frame and compute the cosine distance between consecutive frames. For a real video of a static speaker, the identity embedding should be near-constant. For a deepfake, minor identity drift (embedding distance > 0.05 per frame pair) is detectable.

  3. Landmark velocity smoothness: detect 68 facial landmarks per frame; compute the velocity and acceleration of each landmark. Real facial motion follows smooth biomechanical trajectories. Deepfake motion has higher jerk (derivative of acceleration) because the generator does not model underlying muscle dynamics.

Production pipeline (non-executing)

This block does not run in your browser

The pipeline below requires cv2 (optical flow), torch, facenet_pytorch (identity embedding), and dlib or mediapipe (landmark detection). GPU is required for real-time throughput. Run in Google Colab: !pip install facenet-pytorch mediapipe.

import cv2
import numpy as np
import torch
from facenet_pytorch import InceptionResnetV1, MTCNN
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
mtcnn    = MTCNN(image_size=160, device=device)
resnet   = InceptionResnetV1(pretrained="vggface2").eval().to(device)

def optical_flow_divergence(prev_gray, curr_gray, mask):
    """Compute mean absolute divergence of Farneback optical flow within face mask."""
    flow = cv2.calcOpticalFlowFarneback(
        prev_gray, curr_gray, None,
        pyr_scale=0.5, levels=3, winsize=15,
        iterations=3, poly_n=5, poly_sigma=1.2, flags=0
    )
    # Divergence: du/dx + dv/dy
    dudx = np.gradient(flow[..., 0], axis=1)
    dvdy = np.gradient(flow[..., 1], axis=0)
    div  = np.abs(dudx + dvdy)
    return float(div[mask > 0].mean()) if mask.sum() > 0 else 0.0

def get_face_embedding(frame_bgr):
    """Extract 512-d ArcFace embedding for the largest detected face."""
    img_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
    pil_img = Image.fromarray(img_rgb)
    face    = mtcnn(pil_img)
    if face is None:
        return None
    with torch.no_grad():
        emb = resnet(face.unsqueeze(0).to(device))
    return emb.cpu().numpy()

def score_video_temporal(video_path: str, max_frames: int = 60) -> dict:
    """
    Multi-signal temporal coherence scoring for video deepfake detection.
    Combines optical flow divergence and identity embedding drift.
    """
    cap = cv2.VideoCapture(video_path)
    frame_count  = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    sample_idx   = np.linspace(0, frame_count - 1, max_frames, dtype=int)

    flow_divs, emb_drifts = [], []
    prev_gray, prev_emb = None, None

    for idx in sample_idx:
        cap.set(cv2.CAP_PROP_POS_FRAMES, int(idx))
        ret, frame = cap.read()
        if not ret:
            continue

        curr_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
        curr_emb  = get_face_embedding(frame)

        if prev_gray is not None:
            # Face mask (bounding box from MTCNN — simplified as full frame here)
            face_mask = np.ones_like(curr_gray)
            div = optical_flow_divergence(prev_gray, curr_gray, face_mask)
            flow_divs.append(div)

        if prev_emb is not None and curr_emb is not None:
            cosine_dist = 1.0 - float(np.dot(prev_emb.ravel(), curr_emb.ravel()) /
                                      (np.linalg.norm(prev_emb) * np.linalg.norm(curr_emb) + 1e-9))
            emb_drifts.append(cosine_dist)

        prev_gray = curr_gray
        prev_emb  = curr_emb

    cap.release()

    mean_div   = float(np.mean(flow_divs))   if flow_divs   else 0.0
    mean_drift = float(np.mean(emb_drifts)) if emb_drifts  else 0.0
    # Simple fusion: weighted sum (weights tuned on DFDC validation set)
    score = 0.4 * (mean_div / 0.15) + 0.6 * (mean_drift / 0.05)
    return {
        "optical_flow_divergence": round(mean_div, 4),
        "identity_embedding_drift": round(mean_drift, 4),
        "composite_score": round(score, 4),
        "verdict": "SYNTHETIC" if score > 1.0 else "AUTHENTIC",
    }

# result = score_video_temporal("suspect_video.mp4")
# print(result)
In practice

During the lead-up to the 2024 Taiwan presidential election, a fabricated video purportedly showing a candidate making inflammatory statements about mainland China circulated on LINE and Facebook. Taiwan’s FactCheck Center deployed a video-forensics pipeline integrating optical flow coherence checking with an EfficientNet face-forensics model. The video was flagged within four hours of upload; optical flow divergence in the lip region during speech was 3.2× higher than the threshold calibrated on authentic political speech videos. The formal correction was issued before the video reached peak sharing velocity. Source: Taiwan FactCheck Center Report, Jan 2024.


Provenance and C2PA

Cryptographic content credentials

Detection-based approaches — analyzing statistical signatures in content — are inherently reactive and asymptotically disadvantaged: better generators produce smaller signatures. An orthogonal strategy is provenance-based authentication: attach a cryptographically signed record of origin to every piece of content at the point of creation, and verify that record at the point of consumption.

The Coalition for Content Provenance and Authenticity (C2PA) is an open technical standard developed by Adobe, Microsoft, Sony, BBC, Intel, and Reuters and released publicly in 2021 (v2.0 in 2023). C2PA defines a Content Credential: a JSON-LD manifest attached to a media file (JPEG, MP4, PDF, MP3) that records:

  • Provenance: the device, software, and identity that produced the file
  • History of edits: every tool that modified the content, with timestamps and operator identity
  • AI generation flag: whether any component was produced by a generative model
  • Cryptographic binding: a hash of the content and the manifest, signed by the creator’s certificate chain (X.509, same infrastructure as HTTPS)

The verification equation: given a content file \(f\) and its attached credential \(c = (\text{manifest}, \sigma)\), the verifier checks:

\[\text{Valid}(f, c) = \begin{cases} \text{true} & \text{if } \text{Verify}(\sigma, H(f \| \text{manifest}), \text{pk}_\text{issuer}) = 1 \\ \text{false} & \text{otherwise} \end{cases}\]

where \(H(\cdot)\) is a collision-resistant hash (SHA-256), \(\sigma\) is the creator’s digital signature, and \(\text{pk}_\text{issuer}\) is the public key from a trusted certificate authority. If the file has been modified after signing, the hash will not match and verification fails.

As of 2025, C2PA credentials are natively produced by Leica M11-P cameras, Nikon Z8, Canon EOS R1, Adobe Photoshop (on export), Adobe Premiere Pro, and the APIs of Stability AI and Google Imagen. Verification is supported in Adobe Acrobat, Microsoft Edge, and the open-source c2patool CLI. The social platform Shutterstock displays a “Content Credentials” badge on all C2PA-verified submissions.

The limitation of C2PA is the supply-chain problem: a credential proves only that the file was produced by a particular device or software, not that the depicted content is real. A deepfake video produced by a credentialed Adobe Premiere plugin will carry a valid C2PA credential — it will accurately record that the video was AI-generated, but only if that flag is correctly set by the software. The system’s integrity depends on the software in the credential chain correctly reporting generation provenance, which is a trust question about software vendors, not a cryptographic one.

SynthID (Google DeepMind, 2024) is a complementary approach: instead of an attached credential, it embeds a perceptually invisible watermark in the pixel, audio, or text domain at generation time, detectable by Google’s verification API. SynthID watermarks survive JPEG compression, rescaling, and light editing; they do not survive adversarial attacks specifically designed to remove them.


The Adversarial Game

The detection-ceiling theorem

Sadasivan et al. (2023) prove a formal result with disturbing implications for the field. Let \(p_H\) be the true distribution of human-generated text and \(p_M\) be the distribution of machine-generated text. A detector is a binary classifier \(f: \mathcal{X} \to \{0, 1\}\). The total variation distance between the two distributions is:

\[\text{TV}(p_H, p_M) = \frac{1}{2} \int |p_H(x) - p_M(x)| \, dx\]

The maximum achievable AUC for any detector is bounded by:

\[\text{AUC}^* \leq \frac{1}{2} + \frac{1}{2} \cdot \text{TV}(p_H, p_M)\]

As the generator improves — as \(p_M \to p_H\) in total variation distance — the upper bound on any detector’s AUC converges to \(\frac{1}{2}\) (random chance). The better the generator, the harder the detection problem, in a precise information-theoretic sense. This is not a limitation of current methods; it is a fundamental bound that applies to any possible detection algorithm with access only to the text.

The practical implication is not that detection is pointless — at current generator quality, substantial gaps remain between \(p_H\) and \(p_M\), and those gaps are exploitable. The implication is that detection alone is not a viable long-run defense strategy. As generator quality continues to improve, marginal investments in pure detection yield diminishing returns.

The detection-ceiling implies defense in depth

A content moderation strategy that relies exclusively on statistical text detection, image frequency analysis, or audio phase scoring will, as generators improve, eventually fail. The research community and industry consensus as of 2026 is that robust synthetic-content defense requires a portfolio approach: statistical detection for current-generation content; provenance credentials (C2PA, SynthID) embedded at creation; platform-level signals (account age, posting velocity, network structure of early sharers); and human review for high-stakes decisions. No single layer is sufficient; the layers are complementary. Detection tells you about the content; provenance tells you about its origin; network analysis tells you about its propagation. All three are needed.

The paraphrase and adversarial perturbation attack surface

Beyond the theoretical bound, adversaries have practical tools. For text: paraphrase with a second LLM, adjust temperature, or use prompt engineering to shift writing style. For images: apply JPEG compression (destroys GAN frequency fingerprints at quality settings below 85), add Gaussian noise at \(\sigma = 2\)–\(5\) pixel units, or apply adversarial perturbations (Carlini et al., 2023) specifically computed to fool the deployed detector. For audio: resample to a lower bitrate, apply a light EQ filter, or mix with background noise at SNR above 20 dB. For video: re-encode at H.264 CRF 28 (typical social-media compression setting), which smooths optical flow inconsistencies.

The adversarial-perturbation attack is the most concerning: given white-box or black-box access to a detector, an attacker can compute pixel-level perturbations that minimize the detector’s probability output while keeping the image perceptually unchanged. These attacks transfer across model architectures at a rate of 30–70%, meaning that even black-box attacks against unknown detectors are partially effective.


Industry Tools

The detection ecosystem (as of 2026)

Tool Modality Method Availability
Reality Defender Text, Image, Audio, Video Ensemble of specialist models + watermark detection API, Enterprise SaaS
Sensity AI Image, Video, Audio GAN/diffusion classifiers, face-swap detection API, OSINT platform
Hive Moderation Text, Image, Video Fine-tuned CNNs + LLM-based text scoring API
Truepic Image C2PA credential issuance and verification SDK for mobile devices
Microsoft Video Authenticator Video Per-frame face-manipulation scoring + temporal coherence Enterprise/API
Google SynthID Image, Audio, Text Invisible watermarking at generation Embedded in Imagen, Gemini
OpenAI Text Classifier Text Log-probability scoring Retired July 2023
Pindrop Protect Audio Mel-cepstral phase analysis + neural classifier Enterprise telephony
Resemble Detect Audio Real-time cloning detection API, <1 s latency
ElevenLabs AI Speech Classifier Audio Trained to detect ElevenLabs output Free, public API

The industry tools differ primarily in their training data (which generator architectures they have been exposed to), their handling of compression and codec artifacts, and their false-positive rates on edge-case authentic content (extreme compression, non-standard cameras, low-resource language audio). No tool currently achieves better than 80% AUC across all generator types, compression settings, and modalities simultaneously, according to independent evaluations by the AI Safety Institute (UK DSIT, 2025).

In practice

A major European bank’s fraud operations team deployed Pindrop Protect on all inbound authentication calls in Q3 2024, following three successful voice-clone social-engineering attacks that bypassed knowledge-based authentication and transferred a combined $2.3 million in the prior quarter. Within six months of deployment, Pindrop flagged 47 voice-clone attempts at the authentication gateway, with a confirmed true-positive rate of 91% (4 false positives in live operation over the period). The system’s latency — under 800 ms to a classification decision — was within the real-time threshold acceptable for call-center deployment. Source: Internal case study disclosed at RSA Conference, May 2025.


Evaluation

ROC vs Precision-Recall on imbalanced data

The practical deployment environment for deepfake detectors is highly imbalanced: on any given platform at any given time, the large majority of content is authentic. Sensity AI estimates the prevalence of clearly synthetic images on major social platforms at 5–15%; for audio, it may be below 1%. This imbalance has a profound effect on which evaluation metrics are meaningful.

The ROC curve plots True Positive Rate (TPR, recall) against False Positive Rate (FPR) as the decision threshold varies. AUC-ROC is insensitive to class imbalance — it tells you about the model’s ability to rank positives above negatives, regardless of the base rate. It is the right metric for comparing detector capability.

The Precision-Recall (PR) curve plots precision against recall. PR-AUC is highly sensitive to class imbalance and directly reflects the operating cost: at 1% prevalence, a detector with 90% TPR and 95% TNR produces a precision of only:

\[\text{Precision} = \frac{0.01 \times 0.90}{0.01 \times 0.90 + 0.99 \times 0.05} = \frac{0.009}{0.009 + 0.0495} \approx 15.4\%\]

This means that in operational deployment, more than 84% of flagged items are false positives — authentic content incorrectly classified as synthetic. At scale, this is catastrophically expensive: human reviewers spend the majority of their time clearing false alarms, and legitimate creators are incorrectly accused of deception.

Cost asymmetry and operating point selection

The choice of decision threshold should be driven by the cost asymmetry between error types:

\[\text{Expected cost} = C_\text{FN} \cdot (1 - \text{TPR}) \cdot \pi + C_\text{FP} \cdot \text{FPR} \cdot (1 - \pi)\]

where \(\pi\) is the true prevalence, \(C_\text{FN}\) is the cost of a false negative (a synthetic item that reaches the platform), and \(C_\text{FP}\) is the cost of a false positive (an authentic creator incorrectly flagged). In a high-stakes election misinformation context, \(C_\text{FN} \gg C_\text{FP}\) — missing a synthetic deepfake of a candidate is far costlier than briefly flagging a genuine photo for review. On a creative-art platform, \(C_\text{FP} \gg C_\text{FN}\) — falsely flagging an AI-assisted artwork is more costly than letting a clearly synthetic image circulate unlabelled.

The operating point (threshold \(\tau^*\)) should minimise expected cost:

\[\tau^* = \argmin_\tau \; C_\text{FN} \cdot (1 - \text{TPR}(\tau)) \cdot \pi + C_\text{FP} \cdot \text{FPR}(\tau) \cdot (1 - \pi)\]

Live cell: tuning operating point on the text detector

Before running the cell above, predict: given 8% prevalence of AI-generated text, if the detector has 85% true positive rate and 10% false positive rate, what fraction of the flagged items will actually be AI-generated? Work through the arithmetic (Bayes’ theorem). Compare your answer to the precision value the cell reports at the cost-optimal threshold.

Interpretation. The PR curve makes the imbalanced-data reality visible. The no-skill baseline (a classifier that flags everything as AI) achieves precision exactly equal to the prevalence (8%), at recall of 100%. A useful detector must sit substantially above that baseline across all recall levels. The confusion matrix at the cost-optimal threshold \(\tau^*\) shows the concrete operational outcome: even at a threshold chosen to minimize expected cost, there will typically be more false positives than true positives in an imbalanced deployment. The right response is not to declare the detector useless but to design the human-review workflow around this reality — triaging flagged items by score decile, with the highest-decile flags receiving priority review.


Mini Case Study — Multi-Modal Fusion

The problem setup

A media-verification team at a news agency receives a user-submitted video clip purportedly showing a public official making a controversial statement at an unannounced press conference. The clip has no embedded C2PA credentials. The team’s task: assess the probability that the video is fabricated, combining evidence from the text transcript, the face imagery, and the audio track.

A multi-modal detector that combines signals from all three modalities will, in general, outperform any single-modality detector, because different synthesis technologies leave different artifacts and a sophisticated adversary is unlikely to have evaded all three simultaneously. The fusion strategy we use is late fusion: run single-modality detectors independently, then combine their output scores using a logistic regression fusion layer.

The fusion model:

\[P(\text{synthetic} \mid \mathbf{s}) = \sigma\!\left(\beta_0 + \beta_1 s_\text{text} + \beta_2 s_\text{image} + \beta_3 s_\text{audio}\right)\]

where \(s_\text{text}\), \(s_\text{image}\), \(s_\text{audio} \in [0, 1]\) are the per-modality synthetic-probability scores, \(\sigma(\cdot)\) is the logistic sigmoid, and \(\beta_1, \beta_2, \beta_3\) are weights learned on a labeled training set of known authentic and synthetic videos.

Live cell: multi-modal fusion with toy data

Before running the cell above, predict: which modality do you expect to have the largest fusion coefficient — text, image, or audio — given the data-generating process described in the comments? How would the coefficients change if the image detector were re-trained to be much more accurate on face-swaps?

Interpretation. The fusion model learns to weight the text and audio scores most heavily, because those detectors have cleaner separation between authentic and synthetic in this scenario — the simulated adversary used high-quality face-swap but lower-quality TTS. This is a realistic scenario: most current adversaries do not optimise all modalities simultaneously. The confusion matrix shows that multi-modal fusion achieves meaningfully better separation than any individual modality alone; the AUC around 0.90 on this toy dataset reflects the compounding of complementary signals.

For the hypothetical video (text score 0.72, image score 0.48, audio score 0.81), the fusion model assigns a high synthetic probability: two of three detectors fire strongly, and even the uncertain image score does not pull the fusion output below the 0.5 threshold. The team would flag this video for mandatory human forensic review before any editorial decision — the appropriate operational response given the high cost asymmetry (\(C_\text{FN} \gg C_\text{FP}\)) in a news-publication context.


Ethics and Limits

The creator’s dilemma

The synthetic-content detection apparatus described in this chapter, if deployed without care, imposes a significant chilling effect on legitimate creative and expressive practice. AI-assisted image editing — background removal, noise reduction, colour grading, generative fill — is now embedded in the standard toolkit of professional photographers, graphic designers, and social-media content creators. A binary detector that flags any image with an AI-processed component as “synthetic” will incorrectly suppress a vast volume of authentic creative work. The distinction that matters ethically is not whether AI touched the content but whether the content misrepresents reality — a question that current detectors are not designed to answer.

This is not a marginal concern. The EU AI Act (2024) and the UK Online Safety Act (2023) both distinguish between synthetic content that is clearly labelled as AI-generated (permitted) and synthetic content that creates a false impression of reality (regulated). A detection regime that collapses this distinction treats the Midjourney artist and the disinformation operator as equivalent — and is therefore both unjust and, ultimately, counterproductive, because it undermines the legitimacy of the detection system in the eyes of the creative community.

Freedom of AI art and parody

Historically, satirical exaggeration — caricature, parody, political satire — has enjoyed strong legal protection in most democratic legal systems precisely because the social value of mockery and critique outweighs the reputational cost to the subject. AI-generated satirical images of political figures exist in an unresolved legal grey zone: they are clearly synthetic (no detection required), they may be clearly labelled, and they serve a legitimate expressive function. Deploying deepfake-detection tools to suppress political satire — even satire that makes a politician uncomfortable — is a weaponisation of provenance-authenticity infrastructure for censorship.

The appropriate framework is not to ask “is this AI-generated?” but to ask “does this content claim to be something it is not?” A clearly labelled satirical deepfake is not a deepfake problem; it is a content-moderation problem governed by existing speech norms. Detection infrastructure should be scoped accordingly.

Threat to journalism and the epistemic commons

Perhaps the most significant harm from synthetic content is not any individual deepfake video but the aggregate epistemic effect: the possibility that any video, photograph, or audio recording might be fabricated erodes the evidentiary basis on which journalism, legal proceedings, and democratic deliberation depend. When audiences cannot distinguish authentic from synthetic at scale, the rational response is to distrust all media — a state of affairs that authoritarian actors actively exploit by flooding the information environment with synthetic content to make authentic evidence seem equally uncertain.

Detection and provenance systems are partial remedies. The deeper remedy is epistemic infrastructure: media literacy education that teaches audiences how provenance systems work; platform policies that require disclosure of AI-generated content; journalist training programs in basic forensic verification; and investment in the C2PA credential ecosystem so that authentic content arrives with verifiable provenance rather than requiring ex-post authentication.

The analyst building a social-media analytics pipeline has a small but non-trivial role in this infrastructure. Every pipeline that detects and labels synthetic content before it is fed into downstream models — sentiment classifiers, topic models, influence-network analyzers — reduces the contamination of the analytical basis. Getting this right is not a marginal quality-control concern; it is the foundation on which the credibility of everything else in this book rests.

In practice

UNESCO’s Media and Information Literacy curriculum, updated in 2025, now includes a dedicated module on AI-generated content recognition for secondary-school students. The module teaches students to check for C2PA credentials in news photographs, to apply the “three-source rule” before sharing video clips of public figures, and to use the publicly available ElevenLabs Speech Classifier and Microsoft Video Authenticator as spot-check tools. Piloted in 14 countries, the curriculum showed a 34% reduction in willingness to share unverified synthetic content among participating students versus a control group, measured at a 6-month follow-up. Source: UNESCO MIL Clearinghouse, 2025.



Prof. Xuhu Wan  ·  HKUST  ·  Modern AI Stack for Social Data  ·  2026 Edition

 

Prof. Xuhu Wan · HKUST · Modern AI Stack for Social Data