Chapter 3: Multimodal Analysis — Text, Image, and Video
About This Chapter
Open Instagram and count the posts that are pure text. The number approaches zero. Open TikTok: every piece of content is video, audio, and on-screen text intertwined. Open YouTube: a thumbnail image determines whether a viewer clicks before a single word is read. Modern social media is an inherently multimodal medium — images, video frames, audio tracks, captions, and overlaid text are fused into a single communicative act, and any analytics pipeline that processes only the text component is working with an amputated signal.
This matters for a wide range of practical tasks. A brand safety system that reads captions but cannot see the image will miss the photo of a competitor’s product embedded in an otherwise neutral post. A content moderation tool that transcribes audio but ignores video frames cannot detect graphic content. An influencer analytics platform that counts words but ignores thumbnail aesthetics misses the primary driver of click-through rate. The gap between what text-only models can see and what a human reviewer immediately perceives is large, and closing that gap is the central engineering challenge of contemporary social media analytics.
The solution the field has converged on — since the landmark CLIP paper from OpenAI in 2021 — is the joint multimodal embedding: a single learned vector space in which images, text, video frames, and audio can all be represented, compared, and retrieved across modalities. The intuition is powerful: if a photo of a sunlit beach and the sentence “a sunny day at the coast” are mapped to nearby points in the same high-dimensional space, then every downstream task — classification, retrieval, anomaly detection — can proceed using the same nearest-neighbour machinery that drives text-only semantic search.
This chapter builds the conceptual and practical foundation for multimodal analytics, working from first principles. We start with the pixel: what is an image, mathematically, and how do convolutional operations extract structure from it? We then move to the CNN as a learned feature extractor, to CLIP and joint vision-language embeddings, and to the generation-side models (BLIP, LLaVA, GPT-4V) that can caption, answer questions about, and reason over visual content. We cover video as a sequence-of-frames problem, audio as a time-series-to-spectrogram problem, and we close with two applied architectures: brand logo and face detection, and a full multimodal content-moderation pipeline of the type deployed by Meta and TikTok at production scale.
Where real computer vision libraries — PyTorch, torchvision, OpenCV’s full pipeline, the transformers vision modules — are required, the code appears in non-executing python blocks clearly marked as such. For all live browser cells, the demonstrations use pure numpy, scipy, and matplotlib to illustrate the underlying mathematics without requiring GPU infrastructure. The goal is conceptual fluency: after this chapter, you will be able to design a multimodal analytics pipeline, communicate it to an engineering team, interpret its outputs critically, and identify where it will fail.
The primary technical references for this chapter are Radford et al. (2021), “Learning Transferable Visual Models From Natural Language Supervision” (the CLIP paper); Li et al. (2022), “BLIP: Bootstrapping Language-Image Pre-training”; and Liu et al. (2023), “Visual Instruction Tuning” (LLaVA). For convolutional networks, the canonical pedagogical source remains LeCun et al. (1998) and Goodfellow, Bengio & Courville, Deep Learning (MIT Press, 2016), Chapters 9–11. All python blocks in this chapter require libraries that cannot run in the browser; copy them to Google Colab or a local Python environment to execute.
Table of Contents
- Images as Tensors
- Convolutional Kernels: Edge Detection and Blur from Scratch
- From Pixels to Features: CNN Intuition
- CLIP and Joint Vision-Language Embeddings
- Live Demo: Embedding-Based Retrieval with Random Vectors
- Vision-Language Models for Captioning and VQA
- Video as a Sequence of Frames
- Audio: From Waveform to Features
- Detecting Brand Logos and Faces
- Building a Multimodal Content-Moderation Pipeline
- Closing: The End-to-End Modern Analytics Stack
Images as Tensors
Pixels are numbers
Every digital image is, at its mathematical core, a rectangular array of numbers. A grayscale image of height \(H\) and width \(W\) is a matrix \(X \in \mathbb{R}^{H \times W}\), where each entry \(x_{i,j} \in [0, 255]\) is the pixel intensity at row \(i\), column \(j\) — 0 is black, 255 is white. A color image adds a third dimension for the three color channels (Red, Green, Blue), giving a tensor \(X \in \mathbb{R}^{H \times W \times 3}\). A batch of \(N\) color images as typically fed into a neural network has shape \((N, H, W, 3)\) in TensorFlow convention, or \((N, 3, H, W)\) in PyTorch convention.
This tensor representation has an immediate practical implication: every image operation is a matrix operation. Brightness adjustment is scalar multiplication. Blending two images is a weighted sum. Detecting edges is convolution with a specific kernel. And learning a feature detector — a convolutional filter — is an optimization problem over kernel weights, solvable by gradient descent on the same loss functions used to train language models. The mathematical machinery of deep learning applies uniformly across text, images, and audio, because all three reduce to tensors.
The cell below generates a small \(16 \times 16\) synthetic grayscale image, visualises it with matplotlib, and demonstrates how numpy operations directly manipulate pixel values.
Interpretation. The left panel shows the structured \(16 \times 16\) grayscale image as rendered pixels. The middle panel shows a warm-tinted RGB version — constructed by independently scaling the three color channels from the same source matrix. The right panel shows the intensity histogram: most pixels cluster near the gradient background values (0–80), with a sharp spike at 220–255 from the cross and the logo region. In a real image analysis task, histogram shape is a useful first diagnostic: a bimodal histogram often indicates a foreground-background separation that simplifies segmentation; a nearly uniform histogram suggests a natural scene with rich detail.
Convolutional Kernels: Edge Detection and Blur from Scratch
What is convolution?
A convolutional kernel (also called a filter) is a small matrix — typically \(3 \times 3\) or \(5 \times 5\) — that slides across the image, replacing each pixel with a weighted sum of its neighbours. Formally, for a grayscale image \(X\) and kernel \(K\) of size \(k \times k\), the output at position \((i, j)\) is:
\[y_{i,j} = \sum_{u=0}^{k-1} \sum_{v=0}^{k-1} K_{u,v} \cdot X_{i+u,\, j+v}\]
This is discrete two-dimensional convolution (technically cross-correlation in the signal-processing convention, but the distinction does not matter for neural networks because the kernels are learned and can represent either). The kernel weights determine what structure the operation detects:
- A kernel whose rows sum to zero and whose pattern has opposite signs above and below a horizontal midpoint detects horizontal edges — transitions from dark to bright moving downward.
- A kernel with equal positive weights everywhere (a “box blur”) averages neighbouring pixels, smoothing out noise and fine detail.
- The Sobel operator combines horizontal and vertical edge detection to produce a gradient magnitude image.
In a Convolutional Neural Network (CNN), these kernels are not hand-crafted; they are learned parameters, optimised by gradient descent to detect the features that are most useful for the task. But understanding the hand-crafted case first gives direct intuition for what a trained CNN is doing at its lowest layers.
Before running, predict: the Sobel edge kernel has opposite signs in its left and right columns. Which pixels in our synthetic cross image will show the highest edge response — the interior of the cross bars, or the boundaries between the cross and the background?
Interpretation. The prediction is confirmed: the highest gradient magnitude responses appear at the boundaries between the bright cross bars and the darker background, not at the interior of the bars. The interior is uniform (constant pixel value 220), so the weighted-difference kernel produces near-zero output there. The boundary is a sharp transition, so the kernel produces a large positive or negative value. This is the fundamental reason why CNNs trained on image recognition learn edge detectors in their first layer — edges are the most information-dense structures in natural images, and detecting them early enables all subsequent feature extraction.
The blur filter does the opposite: it erases sharp transitions by averaging them with their neighbours. Blur corresponds to low-pass filtering in the frequency domain. In a CNN, the pooling operation (discussed in the next section) performs a related function: it reduces spatial resolution deliberately, forcing the model to extract higher-level structure rather than memorising pixel-exact patterns.
From Pixels to Features: CNN Intuition
Stacking layers to build abstraction
A single convolution followed by a nonlinearity detects edges and simple textures. A second convolution on top of that output detects combinations of edges — corners, curves, circles. A third layer detects combinations of those — eyes, wheels, logos. This is the central intuition of the Convolutional Neural Network (CNN): by stacking learned convolutions with nonlinearities and spatial pooling, the network builds a hierarchy of increasingly abstract feature detectors, moving from pixels to semantics.
Formally, a single convolutional layer with \(C_\text{out}\) output channels, kernel size \(k \times k\), and \(C_\text{in}\) input channels computes:
\[y_{i,j,c} = \sigma\!\left(\sum_{c'=1}^{C_\text{in}} \sum_{u=0}^{k-1} \sum_{v=0}^{k-1} W_{u,v,c',c} \cdot x_{i+u,\,j+v,\,c'} + b_c\right)\]
where \(W \in \mathbb{R}^{k \times k \times C_\text{in} \times C_\text{out}}\) is the learned weight tensor, \(b_c\) is a bias term per output channel, and \(\sigma\) is a nonlinear activation function — typically ReLU (\(\sigma(z) = \max(0, z)\)). The ReLU activation is computationally trivial but essential: without a nonlinearity between layers, a stack of linear convolutions collapses to a single linear convolution, and the depth of the network provides no representational benefit.
Pooling and spatial hierarchy
After each convolutional layer, a pooling operation reduces spatial resolution. Max-pooling over a \(2 \times 2\) window with stride 2 replaces each non-overlapping \(2 \times 2\) block with its maximum value, halving both height and width. This achieves two goals: it reduces the number of parameters in subsequent layers (a feature map that is \(56 \times 56\) instead of \(224 \times 224\) has 16× fewer spatial locations), and it introduces translation invariance — if an edge shifts by one pixel, the max-pool output is unchanged, so the network’s higher layers need not memorise exact pixel positions.
The architectural timeline
The CNN architecture that made deep learning viable at scale was LeNet-5 (LeCun et al., 1998), a five-layer network used for handwritten digit recognition in postal sorting machines. LeNet established the conv-pool-conv-pool-fully-connected pattern that all subsequent architectures inherit.
AlexNet (Krizhevsky, Sutskever & Hinton, 2012) scaled LeNet to ImageNet — 1.2 million labeled images across 1,000 categories — and won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) by a 10-percentage-point margin. AlexNet used five convolutional layers, three fully connected layers, ReLU activations (replacing the sigmoid used by LeNet), and data augmentation. Its win in 2012 is the commonly cited start of the modern deep learning era.
ResNet (He et al., 2015) introduced residual connections — skip connections that add the input of a block directly to its output (\(y = F(x) + x\)) — solving the vanishing gradient problem that made networks deeper than ~20 layers fail to train. ResNet-152 won ILSVRC 2015 with top-5 error of 3.57%, below human-level performance on the benchmark. ResNet architectures (ResNet-50, ResNet-101) remain the workhorses of applied computer vision.
Vision Transformer (Dosovitskiy et al., 2020) applied the transformer architecture from NLP to image classification. An image is divided into a grid of \(16 \times 16\) pixel patches; each patch is flattened and linearly projected to form a sequence of “visual tokens”; standard multi-head self-attention is applied over this sequence. ViT achieves state-of-the-art performance at large scale, and is the image encoder used in CLIP (Section 4).
Instagram visual search and brand monitoring. Instagram’s visual discovery infrastructure (deployed across Reels, Stories, and the Explore feed) uses ResNet and ViT-based embeddings to power visual similarity search. When a user saves an outfit photo, the system embeds the image and retrieves visually similar products from participating merchant catalogs — no text query required. For brand monitoring teams, this creates an asymmetry: a competitor’s product can appear in user posts without any caption text naming the brand, yet visual-search-based monitoring tools will detect it. Text-only social listening, which dominated brand monitoring until 2020, systematically undercounts visual brand exposures. A 2022 study by a major consumer goods company found that visual brand mentions exceeded text mentions by a factor of 2.3 on Instagram — an exposure volume that was entirely invisible to the text-only monitoring system.
CLIP and Joint Vision-Language Embeddings
The alignment problem
Before 2021, the dominant paradigm for image classification was supervised learning with fixed label sets: train a CNN on a dataset labelled with specific categories (dog, cat, airplane), and the model can classify images into those categories at deployment. This works well when the categories are known in advance and when labelled training data is abundant. Both conditions frequently fail in practice: a brand safety system must detect new types of violations as they emerge; a content moderation system must handle image categories that did not exist when the model was trained; a visual search system must retrieve from a catalogue that changes daily.
CLIP — Contrastive Language-Image Pre-training, introduced by Radford et al. (2021) at OpenAI — resolves this by training not a classification head but an alignment between image and text embedding spaces. The training data is 400 million (image, text) pairs scraped from the internet — images with their captions, alt-text, and surrounding text. Two encoders are trained simultaneously:
- An image encoder (a ResNet or Vision Transformer) that maps each image to a vector \(\mathbf{v}_I \in \mathbb{R}^d\).
- A text encoder (a transformer) that maps each caption to a vector \(\mathbf{v}_T \in \mathbb{R}^d\).
The training objective forces the embeddings of matching (image, caption) pairs to be close, while pushing non-matching pairs apart.
The contrastive loss
The specific loss function is InfoNCE (Noise-Contrastive Estimation), applied symmetrically over a batch of \(N\) image-text pairs. For a batch \(\{(I_i, T_i)\}_{i=1}^N\), define the cosine similarity matrix \(S \in \mathbb{R}^{N \times N}\) where \(S_{ij} = \mathbf{v}_{I_i}^\top \mathbf{v}_{T_j} / \tau\) and \(\tau\) is a learnable temperature parameter. The loss is:
\[\mathcal{L}_\text{CLIP} = -\frac{1}{2N} \sum_{i=1}^{N} \left[\log \frac{\exp(S_{ii})}{\sum_{j=1}^{N} \exp(S_{ij})} + \log \frac{\exp(S_{ii})}{\sum_{j=1}^{N} \exp(S_{ji})}\right]\]
The first term treats each image as a query and the correct caption as the positive — a standard cross-entropy over the \(N\) text candidates. The second term does the symmetric thing with text as query. Minimising this loss forces the model to make each true pair \((I_i, T_i)\) maximally similar while treating all \(N-1\) other texts as negatives. At the scale CLIP was trained — 400M pairs, batch size up to 32,768 — this contrastive pressure is extraordinarily strong.
What CLIP enables
The aligned embedding space unlocks several capabilities with no additional training:
Zero-shot image classification. To classify an image into one of \(K\) categories, embed the image \(\mathbf{v}_I\) and embed each label as a text phrase (“a photo of a dog”, “a photo of a cat”), producing \(\mathbf{v}_{T_1}, \ldots, \mathbf{v}_{T_K}\). The predicted class is \(\arg\max_k \cos(\mathbf{v}_I, \mathbf{v}_{T_k})\). CLIP achieves 76.2% top-1 accuracy on ImageNet zero-shot — matching a ResNet-50 trained on 1.2M labeled ImageNet examples, without having seen a single labeled ImageNet image.
Image-text retrieval. Given a text query, retrieve the top-\(k\) images from a corpus by cosine similarity. Given an image query, retrieve the top-\(k\) matching captions or documents. This is cross-modal semantic search, the multimodal analogue of the dense retrieval covered in Chapter 3.
Semantic image comparison. Two images can be compared via their text descriptions rather than their pixels: embed both images, project to the text domain by finding nearest text neighbours, and compare conceptually.
Production code: CLIP zero-shot classification
The code below requires the clip package from OpenAI (and therefore torch and torchvision), which cannot be installed in the browser’s Python environment. Copy it to a Google Colab notebook with a GPU runtime (Runtime → Change runtime type → T4 GPU) and run !pip install git+https://github.com/openai/CLIP.git first. The free Colab T4 is sufficient for all CLIP inference examples.
import torch
import clip
from PIL import Image
import requests
from io import BytesIO
# ── Load model ───────────────────────────────────────────────────
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# ViT-B/32 is the smaller, faster CLIP variant; ViT-L/14 is higher quality
# ── Load an image (use any publicly accessible URL or local file) ─
url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/Cute_dog.jpg/320px-Cute_dog.jpg"
response = requests.get(url)
image = preprocess(Image.open(BytesIO(response.content))).unsqueeze(0).to(device)
# ── Define candidate text labels ─────────────────────────────────
candidate_labels = [
"a photo of a dog",
"a photo of a cat",
"a photo of a car",
"a photo of a person",
"a photo of food",
]
text_tokens = clip.tokenize(candidate_labels).to(device)
# ── Forward pass ─────────────────────────────────────────────────
with torch.no_grad():
image_features = model.encode_image(image) # (1, 512)
text_features = model.encode_text(text_tokens) # (5, 512)
# Normalise to unit norm so dot product = cosine similarity
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Similarity scores and softmax probabilities
logits = (image_features @ text_features.T) * model.logit_scale.exp()
probs = logits.softmax(dim=-1).cpu().numpy()[0]
# ── Print results ────────────────────────────────────────────────
print("Zero-shot classification results:")
for label, prob in sorted(zip(candidate_labels, probs), key=lambda x: -x[1]):
bar = '#' * int(prob * 40)
print(f" {prob:.3f} {bar} {label}")
# Expected output for a dog photo:
# 0.934 ######################################## a photo of a dog
# 0.042 ## a photo of a cat
# ...The model classifies the image using only the cosine similarity between the image embedding and the text embeddings — no classification-specific training, no label-specific weights. Changing the candidate labels requires zero additional compute; the image embedding is computed once and then compared against any set of text queries in milliseconds.
Live Demo: Embedding-Based Retrieval with Random Vectors
The mechanism is linear algebra
Real CLIP embeddings cannot run in the browser, but the retrieval mechanism — nearest-neighbour search in cosine distance — is pure linear algebra. The cell below creates a toy corpus of five image captions and five text queries, assigns each a 32-dimensional random embedding, computes the full cross-modal cosine similarity matrix, and performs retrieval in both directions: image-to-text and text-to-image.
The embeddings are random, so the matches are arbitrary — the point of this demo is to make the matrix operations concrete and observable. Swap in real CLIP embeddings, and the same code produces semantically meaningful cross-modal retrieval.
Interpretation. With random embeddings, the highest-similarity match for any query is arbitrary. The takeaway is structural: the entire retrieval system is (1) a matrix multiply — query_embs @ image_embs.T — and (2) an argsort of the resulting similarity scores. There are no loops, no if-statements, no task-specific logic. When CLIP replaces the random embeddings, the matrix entries become semantically meaningful, and the argsort produces the cross-modal results that make visual search feel like magic. The engineering challenge is not in the retrieval code; it is in producing good embeddings at scale.
Vision-Language Models for Captioning and VQA
Beyond retrieval: generating language about images
CLIP creates a shared embedding space but is not generative — it cannot produce text describing an image, only compare an image to a set of candidate texts. The next class of models, Vision-Language Models (VLMs), combines a visual encoder with a language model decoder to produce free-form language conditioned on image input. This enables two tasks that are central to social media content analysis:
Image captioning: given an image, generate a natural-language description. A caption pipeline run at Instagram scale produces text metadata for every image, making vision-based content indexable by standard text analytics tools.
Visual Question Answering (VQA): given an image and a natural-language question, generate a natural-language answer. “Does this post contain a visible product brand logo?” “What is the primary emotion expressed by the person in this image?” “Is there text overlaid on this video frame, and if so, what does it say?” These questions, posed to a VLM, replace ten task-specific classifiers with a single model and a single inference call.
BLIP
BLIP (Bootstrapping Language-Image Pre-training; Li et al., 2022) from Salesforce Research was the first widely deployed open-source VLM. It combines three training objectives: image-text contrastive (like CLIP), image-text matching (binary classification of whether a pair is matched), and image-conditioned language generation. BLIP introduced CapFilt (Caption Filtering), a bootstrapping procedure that uses the model to generate synthetic captions for web images and then filters out noisy ones — substantially improving data quality without human annotation.
LLaVA
LLaVA (Large Language and Vision Assistant; Liu et al., 2023) takes a different approach: use a frozen CLIP image encoder to extract visual features, train a lightweight linear projection layer that maps image features into the token embedding space of a large language model (LLaMA, Vicuna), and instruction-fine-tune the combined system on visual instruction-following data. LLaVA is significant because it demonstrates that a relatively small amount of multimodal instruction tuning — 158,000 samples — is sufficient to produce a highly capable visual assistant when the language model backbone is already strong.
GPT-4V and GPT-4o
OpenAI’s GPT-4V (Vision) and GPT-4o (Omni) models are natively multimodal — the image and text are processed in a single unified forward pass rather than through a separate encoder-projector pipeline. GPT-4o additionally handles audio input and output natively. For social media analytics practitioners, the practical significance is that GPT-4o can receive a post’s image, caption, audio transcript, and hashtags simultaneously in a single API call and return a structured analysis — brand sentiment, detected products, dominant emotion, content safety flags — in a single response.
The code below demonstrates BLIP captioning and VQA using the transformers library. It requires torch and transformers and cannot run in the browser. A free Colab T4 GPU will run both examples in under 60 seconds after a one-time model download (~1.8 GB for BLIP-base).
from transformers import BlipProcessor, BlipForConditionalGeneration, BlipForQuestionAnswering
from PIL import Image
import requests, torch
device = "cuda" if torch.cuda.is_available() else "cpu"
# ── BLIP image captioning ─────────────────────────────────────────
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
caption_model = BlipForConditionalGeneration.from_pretrained(
"Salesforce/blip-image-captioning-base"
).to(device)
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/320px-Cat03.jpg"
raw_image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
# Unconditional captioning
inputs = processor(raw_image, return_tensors="pt").to(device)
out = caption_model.generate(**inputs, max_new_tokens=50)
print("Caption:", processor.decode(out[0], skip_special_tokens=True))
# Example output: "a cat sitting on a wooden floor"
# Conditional captioning: steer toward a domain
inputs_cond = processor(raw_image, "a photo of a", return_tensors="pt").to(device)
out_cond = caption_model.generate(**inputs_cond, max_new_tokens=50)
print("Conditional caption:", processor.decode(out_cond[0], skip_special_tokens=True))
# ── BLIP visual question answering ───────────────────────────────
vqa_model = BlipForQuestionAnswering.from_pretrained(
"Salesforce/blip-vqa-base"
).to(device)
questions = [
"Is there an animal in this image?",
"What color is the animal?",
"Is this image indoors or outdoors?",
]
for question in questions:
inputs = processor(raw_image, question, return_tensors="pt").to(device)
out = vqa_model.generate(**inputs, max_new_tokens=20)
answer = processor.decode(out[0], skip_special_tokens=True)
print(f"Q: {question}")
print(f"A: {answer}\n")GPT-4o multimodal API (single call for full post analysis)
from openai import OpenAI
import base64, json
from pathlib import Path
client = OpenAI() # uses OPENAI_API_KEY from environment
def encode_image(path: str) -> str:
return base64.b64encode(Path(path).read_bytes()).decode()
image_b64 = encode_image("instagram_post.jpg")
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_b64}"}},
{"type": "text",
"text": (
"Analyse this social media post image. Return a JSON object with fields: "
"brand_logos (list of detected brand names), dominant_emotion (string), "
"content_safety_flags (list), estimated_age_group (string), "
"product_categories (list). Be concise."
)},
],
}],
max_tokens=300,
)
result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2))TikTok content moderation at scale. TikTok’s Trust and Safety organisation (TCO) processes more than one billion video uploads per month across 150+ markets. A text-only moderation classifier is insufficient: TikTok content frequently violates policy through visual imagery — dangerous stunts, graphic violence, unauthorised branded content — with captions that appear entirely benign. The moderation pipeline samples keyframes from each video, passes them through a vision-language model, generates structured metadata (scene description, detected objects, inferred activity), and merges this with NLP analysis of the caption, audio transcript, and hashtag text. The combined multimodal signal reduces false-negative rates on graphic content by approximately 60% relative to text-only baselines (industry estimate; exact figures are not publicly disclosed). VLMs like LLaVA enable this without maintaining ten separate task-specific classifiers — a single model answers the full suite of content policy questions.
Video as a Sequence of Frames
The scale problem
A video recorded at 30 frames per second and one minute in length contains 1,800 individual frames. Each frame is a full-resolution image — a \(1920 \times 1080 \times 3\) tensor for HD video, which is approximately 6.2 million numbers per frame, or 11 billion numbers per minute. Running a full CNN forward pass on every frame of every video at YouTube scale (500 hours of content uploaded per minute, as of 2024) is computationally infeasible with any architecture that processes frames independently.
The standard response is a suite of temporal compression strategies that reduce the number of frames requiring deep analysis, while preserving the semantic coverage of the content.
Keyframe extraction
Keyframe extraction selects a small set of frames that are most representative of the video content. The simplest approach is uniform sampling: extract one frame every \(k\) seconds. A 60-second video sampled at one frame per 2 seconds yields 30 frames — a 60× reduction in data volume before any neural network is involved. More sophisticated keyframe selection algorithms detect scene boundaries (a sharp change in pixel values across two adjacent frames indicates a cut or transition) and sample one frame per scene. For a typical 60-second social media video with 8–12 distinct scenes, this reduces the frame count to single digits.
Scene boundary detection can be implemented with a simple pixel-difference threshold:
\[\text{change}_{t} = \frac{1}{HW} \sum_{i,j} |x_{t,i,j} - x_{t-1,i,j}|\]
A frame \(t\) is a scene boundary if \(\text{change}_t > \theta\) for some threshold \(\theta\). In practice, downsampling each frame to \(32 \times 32\) before computing the difference makes this computation negligible in cost.
Optical flow
Optical flow computes, for each pixel, the 2D motion vector describing its displacement between two consecutive frames. The output is a flow field of shape \(H \times W \times 2\) — the same spatial dimensions as the image, but with two channels encoding horizontal and vertical velocity. Optical flow is used to distinguish stationary background regions from moving objects, to estimate camera motion (panning, zooming), and as an input feature for action recognition models. The classical Horn-Schunck and Lucas-Kanade algorithms are available in OpenCV; modern approaches use deep networks trained on synthetic optical flow datasets (RAFT, FlowNet).
3D CNNs and video transformers
For action recognition — classifying what activity is happening in a video clip — the standard approaches are:
3D CNNs (C3D, Carreira & Zisserman 2017): extend the 2D convolution kernel to 3D by adding a temporal dimension. A \(3 \times 3 \times 3\) kernel slides over a clip of frames in both space and time simultaneously, detecting spatio-temporal patterns such as the motion of a hand or the trajectory of a ball.
Video Transformers (TimeSformer, VideoMAE): apply the attention mechanism from ViT to video, treating a short clip as a sequence of frame patches and computing attention across both spatial and temporal dimensions.
For most social media analytics applications, the full complexity of 3D CNNs and video transformers is unnecessary. The practical pipeline is: keyframe extraction → per-frame VLM analysis (captioning or classification) → aggregation of per-frame results by voting or embedding-mean. This is slower than a dedicated video model but is more flexible, requires no video-specific training data, and can leverage any advancing generation of image-language models without retraining.
Audio: From Waveform to Features
The digital audio signal
A digital audio recording is a one-dimensional time series: a sequence of amplitude samples recorded at a fixed sample rate (typically 44,100 Hz for music, 16,000 Hz for speech). At 16,000 Hz, one second of audio is 16,000 numbers; one minute is 960,000 numbers. Like images, audio is a tensor — but 1D rather than 2D.
Raw audio waveforms are not the most useful representation for machine learning models. The human auditory system does not process amplitude at individual moments; it perceives sound as a time-varying distribution of energy across frequency bands. The standard translation of this into a machine-usable feature is the spectrogram: a 2D representation of how the frequency content of the signal evolves over time.
Spectrograms via the Short-Time Fourier Transform
The Short-Time Fourier Transform (STFT) divides the audio signal into short overlapping windows (typically 25 ms wide with 10 ms hop), applies the Fast Fourier Transform (FFT) to each window, and stacks the resulting frequency-domain representations to form a 2D matrix whose axes are time and frequency. The magnitude spectrogram \(|S_{t,f}|\) at time frame \(t\) and frequency bin \(f\) measures the energy of frequency \(f\) during window \(t\).
In practice, the frequency axis is often converted to a mel scale — a perceptually uniform scale where equal distances correspond to equal perceptual pitch differences — producing a mel spectrogram. The mel spectrogram is then log-compressed (taking \(\log(1 + |S|)\)) to match the logarithmic sensitivity of human hearing. Mel-Frequency Cepstral Coefficients (MFCCs) are derived by applying the Discrete Cosine Transform to the rows of the log-mel spectrogram; the first 13–40 coefficients form a compact feature vector widely used in classical speech recognition.
The live cell below generates a synthetic audio signal — a 440 Hz sine wave (concert A) mixed with Gaussian noise — and computes its spectrogram using scipy.signal.spectrogram.
Interpretation. The spectrogram plot makes the frequency-time structure of the audio directly visible. The strong horizontal band starting at 440 Hz in the first half-second is the fundamental tone. The sweep to 880 Hz in the second half appears as a rising diagonal line in the spectrogram — a pattern that is invisible in the raw waveform but immediately clear in the 2D time-frequency representation. This illustrates why spectrograms are the standard input to audio classification models: they convert the audio problem into an image classification problem, making the full power of CNNs and ViTs available without any audio-specific architectural innovation.
Whisper for speech-to-text
For social media content containing speech — podcasts, TikTok voice-overs, YouTube commentary — converting audio to text enables all downstream NLP tools developed in Chapters 2 and 3. OpenAI’s Whisper (Radford et al., 2022) is the dominant open-source speech recognition model: trained on 680,000 hours of multilingual audio from the internet, it achieves near-human word error rates on standard English benchmarks and handles 99 languages without language-specific fine-tuning.
Whisper inference requires torch and the openai-whisper package (or the transformers implementation). The whisper-base model (~150 MB) runs in real time on a CPU; whisper-large-v3 requires a GPU for practical deployment.
import whisper
# Load the base model (multilingual, ~150 MB)
model = whisper.load_model("base")
# Transcribe a local audio file (mp3, mp4, wav, m4a all supported)
result = model.transcribe("tiktok_audio.mp4", language="en")
print("Transcript:")
print(result["text"])
# Segment-level output: timestamp + text per segment
print("\nSegments with timestamps:")
for seg in result["segments"]:
print(f" [{seg['start']:.1f}s – {seg['end']:.1f}s] {seg['text'].strip()}")
# For a Cantonese TikTok video:
result_yue = model.transcribe("cantonese_post.mp4", language="zh")
print("\nCantonese transcript (traditional characters):")
print(result_yue["text"])Whisper’s output is plain text with optional timestamps, which feeds directly into the sentiment analysis and topic modeling pipelines developed in Chapters 2 and 3. The combined audio-transcription-plus-NLP pipeline converts a social video into a structured analytics record in seconds per video at batch scale.
Detecting Brand Logos and Faces
Why brand logos matter
A brand logo appearing in a social media post is a commercial impression — equivalent in economic value to a paid placement, but often unaccounted for in standard advertising attribution models. For a sports brand, every frame of a livestreamed event that shows an athlete wearing their gear is an earned media impression; quantifying that exposure requires automated logo detection. For a brand safety team, a competitor’s logo appearing prominently in a post tagged with their own campaign hashtag is a content crisis in the making. For influencer marketing analytics firms — such as Zoomph, Veritone, or Nielsenone — logo exposure tracking is the core commercial product.
Similarly, face detection (identifying the spatial location of faces in an image, without identifying individuals) is a prerequisite for estimating audience demographics, detecting when brand ambassadors appear in posts, and ensuring compliance with policies around user-generated content featuring people.
Logo detection
Logo detection is treated as an object detection problem: given an image, produce bounding boxes around each instance of a target logo with an associated confidence score. The standard approach uses a YOLO (You Only Look Once) architecture or a Faster R-CNN with a brand-specific fine-tuned classification head. Training data typically comes from a mix of public benchmarks (LOGO-2K+, OpenLogo, Flickr Logos) and proprietary datasets compiled by the monitoring vendor.
The following code requires torch, torchvision, and Pillow. The YOLO example additionally requires ultralytics. Run in a Colab environment with a GPU runtime.
from ultralytics import YOLO
from PIL import Image, ImageDraw, ImageFont
import requests
from io import BytesIO
# Load a YOLOv8 model fine-tuned on brand logos
# In practice, brand monitoring vendors train proprietary models;
# this example uses the generic object detection model as a stand-in
model = YOLO("yolov8n.pt") # nano model, ~6 MB
# Load an image from a URL or local path
image_url = "https://example.com/sponsored_post.jpg"
img = Image.open(BytesIO(requests.get(image_url).content))
# Run inference
results = model(img, conf=0.35) # confidence threshold 0.35
print("Detected objects:")
for box in results[0].boxes:
cls_id = int(box.cls[0])
cls_name = model.names[cls_id]
conf = float(box.conf[0])
xyxy = box.xyxy[0].tolist() # [x1, y1, x2, y2] in pixels
print(f" {cls_name:<20} conf={conf:.2f} bbox={[round(v) for v in xyxy]}")
# Annotated image
annotated = results[0].plot() # returns a numpy array with bounding boxes drawn
Image.fromarray(annotated).save("annotated_post.jpg")
print("\nAnnotated image saved.")Face detection
Face detection — locating faces in an image — is distinct from face recognition — identifying whose face it is. Detection is legally and ethically benign in most jurisdictions; recognition of private individuals without consent is regulated or prohibited under GDPR Article 9, PDPO (Hong Kong), PIPL (China), and the Illinois BIPA. Analytics pipelines should detect faces to count people, estimate age range and emotion, or trigger downstream review, but should not perform identification without explicit legal basis.
The facenet-pytorch library provides a production-quality face detector (MTCNN) and recognition model (InceptionResnet), but only detection — without the recognition step — is appropriate for most social media analytics applications.
from facenet_pytorch import MTCNN
from PIL import Image
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# MTCNN: Multi-Task Cascaded CNN for face detection
mtcnn = MTCNN(
keep_all=True, # detect all faces, not just the largest
device=device,
min_face_size=20, # minimum face size in pixels to detect
thresholds=[0.6, 0.7, 0.7], # P-Net, R-Net, O-Net confidence thresholds
)
img = Image.open("group_photo.jpg").convert('RGB')
# Returns bounding boxes and detection probabilities
boxes, probs = mtcnn.detect(img)
if boxes is not None:
print(f"Detected {len(boxes)} face(s):")
for i, (box, prob) in enumerate(zip(boxes, probs)):
x1, y1, x2, y2 = [int(v) for v in box]
w, h = x2 - x1, y2 - y1
print(f" Face {i+1}: bounding box ({x1},{y1})–({x2},{y2}) "
f"size {w}×{h}px confidence={prob:.3f}")
else:
print("No faces detected.")
# ── Ethics checkpoint ────────────────────────────────────────────
# DO NOT pass detected faces through recognition models
# (InceptionResnet or equivalent) without explicit legal basis.
# Face detection → count, demographics estimate, emotion estimate: generally permissible.
# Face detection → identity lookup: regulated. Consult legal counsel.YouTube copyright detection — Content ID. YouTube’s Content ID system processes every video upload against a reference database of copyrighted material. The system uses audio fingerprinting (matching characteristic spectral patterns from the reference track against the uploaded audio) and video fingerprinting (matching perceptual hashes of keyframes) to identify copyright matches at scale — over 400 hours of video are uploaded per minute, and Content ID checks each one in seconds. When a match is detected, the rights holder’s monetization policy is applied automatically: block, monetize (ads credited to the rights holder), or track (view counts attributed). Content ID is technically a multimodal matching system: audio and visual fingerprints are evaluated jointly, because a user who mutes the original audio track and replaces it with silence can evade audio-only fingerprinting, but the video hash still matches. The lesson for analytics practitioners: copyright enforcement, brand logo detection, and content safety all benefit from joint multimodal signals for the same reason — each individual modality is bypassable in isolation.
Building a Multimodal Content-Moderation Pipeline
Pipeline architecture
Stage 1: Ingestion and parallel feature extraction.
When a post arrives (text, image, and/or video), three parallel pipelines run simultaneously:
- Text pipeline: tokenize caption + hashtags; run through a fine-tuned BERT classifier producing per-policy-dimension scores (hate speech, spam, misinformation, dangerous content, graphic violence). Latency: ~50 ms.
- Image pipeline: extract keyframes (for video); run each frame through a ResNet-based classifier producing per-policy-dimension scores; run a CLIP image encoder to produce an embedding vector. Latency: ~100–300 ms per image.
- Audio pipeline (for video): run Whisper transcription; merge transcript with caption; re-run text pipeline on merged text. Latency: ~500 ms for a 60-second video.
Stage 2: Cross-modal embedding match.
Embed the post’s image (from the CLIP encoder in Stage 1) and the post’s text (from a text encoder) into the shared CLIP-compatible embedding space. Query a vector database of known-violating embeddings — posts that were previously removed with policy label — and retrieve the \(k\) nearest neighbours. If any neighbour has cosine similarity above a threshold \(\theta_\text{match}\) and shares the same policy category as the current post’s Stage 1 prediction, the match is used as a reinforcing signal.
\[\text{match\_signal}_{c} = \mathbf{1}\!\left[\max_{j \in \text{removed}} \cos(\mathbf{v}_\text{post}, \mathbf{v}_j) > \theta_\text{match} \text{ and } \text{label}_j = c\right]\]
Stage 3: Score fusion and routing.
A lightweight score fusion layer (logistic regression or a small MLP) combines:
- Stage 1 text classifier score \(s^\text{text}_c\)
- Stage 1 image classifier score \(s^\text{image}_c\)
- Stage 2 match signal \(\text{match\_signal}_c\)
into a fused score \(\hat{p}_c\) for each policy category \(c\). Three routing outcomes:
| \(\hat{p}_c\) range | Action |
|---|---|
| \(\hat{p}_c > \theta_\text{auto\_remove}\) (e.g., 0.95) | Automated removal; post never shown |
| \(\theta_\text{review} < \hat{p}_c \leq \theta_\text{auto\_remove}\) | Human review queue; post shown with reduced distribution |
| \(\hat{p}_c \leq \theta_\text{review}\) | No action; normal distribution |
Stage 4: Human review and feedback loop.
Posts routed to human review are evaluated by moderators in the appropriate language and cultural context. The outcome — confirm removal, restore, escalate — is logged and fed back into the training pipeline for all three Stage 1 models. This creates a self-improving system: the models are continuously fine-tuned on the most recent policy-violating content, keeping pace with novel formats and adversarial evasion.
Failure modes and safeguards
Three failure modes are common and must be explicitly designed against:
False positives on protected speech. A content safety model trained predominantly on English data may apply policy labels more aggressively to text in minority languages or dialects — not because the content is more violating, but because the model is less calibrated. Meta’s Integrity team has published evidence of this bias and employs per-language calibration layers. For a practitioner building a moderation system, this means: evaluate false-positive rates stratified by language, region, and demographic group, not just in aggregate.
Adversarial evasion. Bad actors adapt. Inserting a single high-contrast pixel (an “adversarial perturbation”) into an image can flip a CNN classifier’s output with high confidence while the image looks unchanged to a human. The standard defense is adversarial training — including adversarially perturbed examples in the training set — and ensemble models that are harder to fool simultaneously.
Threshold drift. A fixed threshold \(\theta_\text{auto\_remove} = 0.95\) may be well-calibrated at launch but drift as the model is fine-tuned on new data. Implement automated calibration monitoring: weekly compute the empirical precision and recall on a random sample of the human-review queue and re-tune thresholds against current performance targets.
Pinterest visual search and brand safety. Pinterest’s business model depends on a virtuous cycle: users discover products through visual search, click through to merchant pages, and purchase. Visual search — “find more items like this pin” — is powered by a Pinterest-internal visual embedding model (PinSage, trained on Pinterest’s own graph of boards and pins) that is distinct from but conceptually similar to CLIP. For brand safety, Pinterest must ensure that searches for “luxury watch” do not surface counterfeit goods and that children’s content searches do not surface adult-adjacent imagery. The moderation challenge is harder than text-based platforms because the same image can be appropriate or inappropriate depending on context: a photo of a swimwear model is appropriate in a fashion board but inappropriate in a children’s fashion board. Pinterest’s solution is contextual moderation: the image embedding is evaluated not in isolation but together with the embeddings of the board it belongs to and the user’s historical engagement, making the effective policy boundary context-dependent. This is the frontier of multimodal moderation: moving from per-post to per-context policy enforcement.
Closing: The End-to-End Modern Analytics Stack
The arc of this book has been a deliberate progression through the layers of the modern social media analytics stack.
Chapter 1 established the text foundation: how raw social media text — tweets, captions, comments, hashtags — is cleaned, normalized, and transformed into numerical representations via bag-of-words, TF-IDF, and early word embeddings. The central lesson was that turning words into numbers is a non-trivial choice, and the choice determines what patterns can be discovered downstream.
Chapter 2 moved to sentiment analysis: from lexicon-based rule systems through classical machine learning to fine-tuned transformer models. The central lesson was that sentiment is context-dependent, domain-specific, and genuinely difficult — a model that achieves 92% accuracy on movie reviews achieves 71% on financial earnings calls without domain adaptation.
Chapter 3 covered Large Language Models: the transformer architecture, pretraining and instruction fine-tuning, prompting strategies, and the practical economics of deploying LLMs on social text at scale. The central lesson was that LLMs have changed the cost structure of NLP work — labelled training data is no longer the bottleneck it once was — but they introduce new failure modes (hallucination, calibration, PII exposure) that demand systematic evaluation.
Chapter 4 (virality and diffusion) examined how content spreads through social networks — the Bass diffusion model, the viral coefficient, seeding strategies, and the analytics of amplification. The central lesson was that content quality and network structure interact multiplicatively: the same post can go nowhere or reach millions depending on where in the social graph it first lands.
This chapter, Chapter 5, added the third dimension: images, video, and audio. Modern social media is inherently multimodal, and the analytics infrastructure must match. Joint vision-language embeddings (CLIP), vision-language models (BLIP, LLaVA, GPT-4V), audio spectrograms and Whisper transcription, and multimodal content moderation pipelines are not specialist topics at the frontier of research — they are production infrastructure running at planet scale inside Meta, TikTok, YouTube, and Pinterest today.
Chapter 6 (forthcoming) will examine misinformation and information operations: how coordinated campaigns are detected, how synthetic media (deepfakes) are identified, and how graph-based methods — combining the network analytics of the earlier chapters with the multimodal signals of this chapter — reveal coordinated inauthentic behavior. The multimodal pipeline built in Chapter 5 is the prerequisite: detecting a deepfake requires reasoning jointly about visual authenticity, audio-visual synchronization, and textual narrative patterns, none of which is sufficient alone.
The unifying theme across all six chapters is the same: social media analytics is an engineering discipline that requires simultaneous competence across text, network, and multimodal domains. The practitioner who understands tokenization but not convolution, or who can run a sentiment model but not interpret a spectrogram, or who can detect brand logos but cannot design a moderation pipeline, will find themselves unable to work with the systems that now govern the information environments of billions of people. This book is an attempt to build that integrated competence — one layer at a time, from first principles to production.
Prof. Xuhu Wan · HKUST · Modern AI Stack for Social Data · 2026 Edition