Chapter 10: Recommenders and Personalized Ranking
About This Chapter
Every chapter in this book has been building toward something. The embeddings chapter gave us a way to represent text and users as points in a geometric space. The foundation models chapter gave us a way to encode rich semantic meaning from raw content. The vector database chapter gave us the machinery to search millions of those points in milliseconds. The graph analytics chapter gave us a way to understand how people relate to one another and to the content they consume. All of those tools are genuinely useful in isolation. But the application that ties them together — the one that actually decides what a billion people see when they open their phones in the morning — is the recommender system.
The recommender is the most consequential piece of software on the modern social-media stack. YouTube’s recommendation engine drives over 70% of total watch time; if you have ever opened YouTube intending to watch one video and emerged two hours later having watched twelve, you have experienced the system working exactly as designed. TikTok’s For You Page is not a feature of the product — it is the product. The entire interaction model, the infinite scroll, the full-screen autoplay, the absence of a search bar as the primary entry point, is built around the assumption that the algorithm knows what you want better than you do. Spotify’s Discover Weekly, launched in 2015, became one of the most-loved product features in the company’s history not because Spotify had better music than its competitors, but because it had a better model of each listener’s latent taste. In all three cases, the recommender is not merely serving users; it is shaping their media diet, their attention, and, through both, their beliefs and preferences over time.
This is also why the recommender is technically the most demanding problem in applied machine learning. It operates at a scale that breaks most conventional engineering assumptions: hundreds of millions of users, hundreds of millions of items, billions of interactions per day, and a requirement to serve a personalised ranked list to each user within twenty milliseconds of a request. It must be right — a bad recommendation costs engagement. It must be fast — a slow recommendation is invisible, because the user has already scrolled past. It must be fresh — yesterday’s viral video is today’s stale content. And it must optimize for objectives that are not fully specified, not fully measurable, and sometimes in direct conflict with each other.
This chapter covers the full arc from mathematical foundations to production realities. We begin with the formal statement of the recommendation problem and trace the intellectual history from collaborative filtering through matrix factorization through deep learning through the transformer-based models that define the current state of the art. We develop the mathematics of Bayesian Personalized Ranking, two-tower retrieval, and multi-objective optimization. We examine the multi-stage retrieval-ranking-reranking pipeline that every large platform uses in production. We confront the hard problems: cold start, counterfactual evaluation, filter bubbles, and the long-run welfare of the people the system is supposed to be serving. And we close with a fully worked architectural design for a short-video recommender at TikTok scale, walking through every design decision from the embedding tables to the online learning loop.
Key papers throughout: Rendle et al. (2009), “BPR: Bayesian Personalized Ranking from Implicit Feedback”; Cheng et al. (2016), “Wide & Deep Learning for Recommender Systems” (Google); Naumov et al. (2019), “Deep Learning Recommendation Model for Personalization and Recommendation Systems” (Meta); Yi et al. (2019), “Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations” (Google); Kang & McAuley (2018), “Self-Attentive Sequential Recommendation” (SASRec); Sun et al. (2019), “BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer.” For ethical dimensions: Bakshy et al. (2015), “Exposure to ideologically diverse news and opinion on Facebook” (Science); Allcott et al. (2020), “The welfare effects of social media” (AER). Production-grade code requiring PyTorch, TorchRec, or Merlin is marked throughout and will not execute in the browser.
Table of Contents
- The Recommendation Problem
- Collaborative Filtering — Classical Roots
- Matrix Factorization and the Netflix Prize
- BPR: Bayesian Personalized Ranking
- Deep Learning for Recommendation
- Two-Tower Retrieval
- The Multi-Stage Recommender Pipeline
- Sequential and Session-Based Recommendation
- Multi-Objective Optimization
- Counterfactual Evaluation and Off-Policy Methods
- The Cold-Start Problem
- Ethical and Long-Run Considerations
- Production Realities — Latency, Throughput, Freshness
- Mini Case Study — Designing a Short-Video Recommender
- Closing — Recommenders as the Defining Application of AI
The Recommendation Problem
Formal statement
The recommendation problem can be stated with deceptive simplicity. Given a user \(u \in \mathcal{U}\), a set of candidate items \(\mathcal{I} = \{i_1, i_2, \ldots, i_m\}\), and a context \(c\) (device, time of day, session history, location), predict a score
\[\hat{y}_{u,i,c} = f_\theta(u, i, c)\]
for each item \(i\), and return the top-\(K\) items ranked by \(\hat{y}_{u,i,c}\). The function \(f_\theta\) is parameterized by a model with parameters \(\theta\), and the entire engineering challenge is in specifying what \(\hat{y}_{u,i,c}\) means, how to train \(\theta\) efficiently at scale, and how to evaluate the result in a way that tracks the thing you actually care about.
The phrase “the thing you actually care about” is doing enormous hidden work. Early recommender systems, built in the 1990s for products like GroupLens and Amazon, were rating predictors: given that a user has assigned explicit star ratings to a subset of items, predict the rating they would assign to an unrated item, and recommend items predicted to receive high ratings. This formulation is mathematically clean and well-studied, but it is not what any major platform actually optimizes.
The evolution of objectives
Rating prediction was replaced, over roughly a decade, by engagement prediction: instead of predicting a five-star rating that users rarely bother to assign, predict whether a user will click, watch, like, share, or comment. Engagement signals are cheap and abundant — every session generates thousands of implicit observations — and they are directly connected to the business metric (a clicked ad generates revenue; a watched video generates advertising inventory). The shift from rating prediction to engagement prediction coincided with the explosion of social media in the 2005–2015 period and produced systems of stunning effectiveness at maximizing the objective they were given.
The problem is that engagement is not satisfaction. A user who watches a clickbait video to the end has generated a positive engagement signal. A user who watches increasingly extreme political content because each video is marginally more engaging than the last is generating strong engagement signals while accumulating genuine dissatisfaction — which only becomes legible to the system weeks later, when the user reduces their usage frequency or leaves the platform entirely. YouTube publicly acknowledged in 2019 that their recommender had driven significant traffic to extremist content precisely because that content was engaging. The mismatch between the short-term engagement signal and the long-term welfare of the user is not an accident of implementation; it is a structural feature of any system that optimizes a proxy for satisfaction rather than satisfaction itself.
The current frontier — visible in published research from YouTube, Spotify, and Pinterest — is the shift toward long-term satisfaction maximization: building systems that explicitly model user welfare over multi-week horizons, penalize patterns associated with regret (low post-consumption ratings, reduced return visit rates), and introduce deliberate diversity to prevent the echo-chamber dynamics that emerge from pure engagement optimization. This is technically harder than engagement prediction, because the labels are noisier, the reward signal is delayed, and the counterfactual (what would the user’s satisfaction have been under a different recommendation policy?) is never observed directly.
The evaluation challenge
The recommender is also unusually difficult to evaluate offline. In a classification task, you hold out a test set and measure accuracy on it. In a recommender, the test data — the interactions you observe — was generated by whatever policy was running at the time. If the system recommended action movies to user A, you can observe whether user A clicked on the action movies. You cannot observe whether user A would have clicked on a documentary, because the system never showed them one. The data is missing not at random; it is missing precisely where the current policy thought the recommendations were bad. Any offline evaluation metric computed on logged data is biased by this selection effect, and the bias can be large. This is the central challenge of counterfactual evaluation, which we take up in detail in Section 10.
Collaborative Filtering — Classical Roots
The intuition
Collaborative filtering (CF) rests on a single, powerful observation: users who have agreed in the past will tend to agree in the future. If you and I have both rated the same fifty movies and our ratings correlate strongly, then when you rate a film I have not seen, that rating is informative about what I would think. This is collaborative — we are, implicitly, using the entire community of users as mutual recommenders — and it is filtering — we are filtering the item space down to what the community collectively predicts will be relevant to each individual.
The observation was first operationalized at scale by Konstan et al. (1997) in the GroupLens system for Usenet news recommendations. GroupLens used user-based CF: for a target user \(u\), find the \(K\) most similar users (by Pearson correlation of ratings), and predict \(u\)’s rating for an unseen item as a weighted average of those neighbors’ ratings:
\[\hat{r}_{u,i} = \bar{r}_u + \frac{\sum_{v \in \mathcal{N}(u)} \text{sim}(u, v) \cdot (r_{v,i} - \bar{r}_v)}{\sum_{v \in \mathcal{N}(u)} |\text{sim}(u, v)|}\]
where \(\bar{r}_u\) is user \(u\)’s mean rating, \(r_{v,i}\) is the rating assigned by neighbor \(v\) to item \(i\), and \(\text{sim}(u, v)\) is the Pearson correlation between users \(u\) and \(v\) computed over the items both have rated.
Item-based CF, popularized by Amazon’s product recommender (Linden, Smith, and York, 2003), flips the analogy: instead of finding similar users, find items similar to those the user has already engaged with, and recommend those. Item-based CF has a practical advantage at Amazon’s scale: item-item similarities change slowly (the items in the catalog are relatively stable over time), while user-user similarities must be recomputed frequently as new users arrive. Precomputing item-item similarities offline and looking them up at query time is much faster than computing user-user similarities on the fly.
The sparsity problem
Both user-based and item-based CF suffer from the sparsity problem: the rating matrix \(R \in \mathbb{R}^{|\mathcal{U}| \times |\mathcal{I}|}\) is almost entirely empty. Netflix has over 200 million subscribers and over 15,000 titles, but the average subscriber rates perhaps 200 titles — an observation density below 0.1%. At that sparsity, the Pearson correlation between most pairs of users is computed from very few co-rated items, making similarity estimates noisy and neighbor sets unreliable.
The sparsity problem motivates the shift from neighborhood-based methods to latent factor methods, which instead of computing similarities directly learn a dense low-dimensional representation of both users and items such that users and items predicted to interact are close in the latent space.
Live demo: user-based CF on a toy rating matrix
Interpretation. The similarity matrix (left) reveals the latent structure the model has recovered without being told about genre: users 0 and 1 are highly similar (both sci-fi fans), users 2 and 3 are highly similar (romance fans), and users 5 and 6 are highly similar (action fans). The rating matrix (right) shows observed ratings in bold blue; italic orange values are CF predictions for unrated slots. Notice that user 0 receives a prediction of roughly 1 for Romance A — correctly inferring that a confirmed sci-fi fan would dislike romance content — based on the high similarity to user 1 who also rated it low. The sparsity problem is visible: several slots receive no prediction because there are too few co-raters to compute a meaningful similarity.
Matrix Factorization and the Netflix Prize
From neighborhoods to latent factors
The intellectual breakthrough of the late 2000s was the recognition that the recommendation problem could be reformulated as matrix completion: given a sparse matrix \(R\) of observed ratings, find a dense matrix \(\hat{R}\) that agrees with \(R\) on the observed entries and generalizes sensibly to the unobserved ones. If both users and items can be represented by vectors in a low-dimensional latent space \(\mathbb{R}^k\), then the predicted rating is simply the dot product of their latent representations:
\[\hat{r}_{u,i} = \mathbf{p}_u^\top \mathbf{q}_i\]
where \(\mathbf{p}_u \in \mathbb{R}^k\) is the user’s latent factor vector and \(\mathbf{q}_i \in \mathbb{R}^k\) is the item’s latent factor vector. The integer \(k\) is a hyperparameter, typically chosen in the range 10–200; it controls the expressiveness of the model and the risk of overfitting.
With bias terms, the prediction becomes:
\[\hat{r}_{u,i} = \mu + b_u + b_i + \mathbf{p}_u^\top \mathbf{q}_i\]
where \(\mu\) is the global mean rating, \(b_u\) is a user-specific bias (some users rate generously, others harshly), and \(b_i\) is an item-specific bias (some items are universally loved or universally disliked regardless of taste match). Adding biases substantially improves predictive accuracy with negligible additional complexity.
In 2006, Netflix launched a public competition offering $1,000,000 to any team that could improve the accuracy of their recommender by 10% on a held-out test set, measured by root mean squared error (RMSE) on star ratings. The prize ran for three years, attracted over 5,000 teams from 186 countries, and produced an enormous body of research on recommendation algorithms. The winning submission (the “BellKor’s Pragmatic Chaos” team, 2009) achieved a 10.06% improvement over Netflix’s baseline Cinematch system — by 0.06 percentage points, after three years of the best machine learning researchers in the world competing. The prize established matrix factorization as the dominant paradigm for recommendation, popularized the alternating least squares (ALS) and stochastic gradient descent (SGD) training algorithms, and demonstrated the value of ensemble methods (the winning solution blended over 100 component models). Somewhat ironically, by the time the prize was awarded, Netflix had shifted its primary business from DVD-by-mail to streaming, and the nature of the recommendation problem had changed: instead of predicting explicit star ratings (which streaming users rarely assign), the system needed to predict play decisions from implicit behavioral signals. The Netflix Prize solution was never deployed in production.
Training by stochastic gradient descent
We train the model by minimizing the regularized squared error on observed ratings:
\[\mathcal{L} = \sum_{(u,i) \in \mathcal{O}} \left(r_{u,i} - \hat{r}_{u,i}\right)^2 + \lambda \left(\|\mathbf{p}_u\|^2 + \|\mathbf{q}_i\|^2 + b_u^2 + b_i^2\right)\]
where \(\mathcal{O}\) is the set of observed (user, item) rating pairs and \(\lambda > 0\) is the regularization coefficient. The SGD update rules for a single observed pair \((u, i)\) with error \(e_{ui} = r_{ui} - \hat{r}_{ui}\) are:
\[\mathbf{p}_u \leftarrow \mathbf{p}_u + \eta \left(e_{ui} \mathbf{q}_i - \lambda \mathbf{p}_u\right)\] \[\mathbf{q}_i \leftarrow \mathbf{q}_i + \eta \left(e_{ui} \mathbf{p}_u - \lambda \mathbf{q}_i\right)\] \[b_u \leftarrow b_u + \eta \left(e_{ui} - \lambda b_u\right)\] \[b_i \leftarrow b_i + \eta \left(e_{ui} - \lambda b_i\right)\]
where \(\eta\) is the learning rate. Each gradient step is \(O(k)\) — fast even for very large \(k\) — making SGD the preferred optimizer for large-scale matrix factorization. The alternative, Alternating Least Squares (ALS), fixes one factor matrix and solves exactly for the other, alternating until convergence; ALS is preferred in distributed settings (used by Spark MLlib) because it parallelizes more cleanly.
Live demo: matrix factorization with SGD
Interpretation. The RMSE curve confirms convergence within roughly 500 iterations; the final RMSE on observed entries is below 0.1, indicating that the four-dimensional latent space is sufficient to capture the structure in this small matrix. The latent factor scatter plot is the most informative: users 0 and 1 (sci-fi fans) cluster together in the upper-left; users 2 and 3 (romance fans) cluster in a different quadrant; users 5 and 6 (action fans) cluster together. Items cluster by genre in the same space: the Sci-Fi items are near the sci-fi users, Romance items near the romance users. The model has discovered genre as a latent factor purely from the co-occurrence of ratings — it was never told what a genre is. This is the core power of latent factor models: they infer interpretable structure from behavioral signals.
BPR: Bayesian Personalized Ranking
The implicit feedback problem
Explicit star ratings are rare. Most behavioral data is implicit: a click, a play, a purchase, a scroll-past. These signals are abundant but ambiguous. A play means something positive, but its absence does not mean the user disliked the item — it may simply mean the user never saw it. The standard matrix factorization objective (minimize RMSE on observed ratings) is inappropriate for implicit data: it treats unobserved as “rating zero,” which conflates “disliked” with “not yet exposed to.”
Rendle, Freudenthaler, Gantner, and Schmidt-Thieme (2009) introduced Bayesian Personalized Ranking (BPR) to handle implicit feedback correctly. The key insight is that the right training signal is not the absolute value of engagement, but the relative ordering: if user \(u\) clicked on item \(i^+\) and did not click on item \(i^-\), then we should learn a model that ranks \(i^+\) above \(i^-\) for user \(u\). We do not need to know why the user did not click on \(i^-\) — we only need to encode the pairwise preference.
The BPR objective
BPR assumes that the probability of a user \(u\) preferring item \(i^+\) over item \(i^-\) follows a sigmoid:
\[p(i^+ \succ_u i^-) = \sigma\!\left(\hat{y}_{u,i^+} - \hat{y}_{u,i^-}\right) = \frac{1}{1 + \exp\!\left(-(\hat{y}_{u,i^+} - \hat{y}_{u,i^-})\right)}\]
where \(\hat{y}_{u,i} = \mathbf{p}_u^\top \mathbf{q}_i\) is the MF score for the user-item pair. BPR maximizes the log-likelihood of the observed pairwise preferences over a training set \(\mathcal{D}_S = \{(u, i^+, i^-) : u \text{ clicked } i^+, u \text{ did not click } i^-\}\), with L2 regularization:
\[\mathcal{L}_{\text{BPR}} = -\sum_{(u, i^+, i^-) \in \mathcal{D}_S} \ln \sigma\!\left(\hat{y}_{u,i^+} - \hat{y}_{u,i^-}\right) + \lambda \|\Theta\|^2\]
where \(\Theta = \{\mathbf{P}, \mathbf{Q}\}\) collects all model parameters. The gradient with respect to a single training triple \((u, i^+, i^-)\) is:
\[\frac{\partial \mathcal{L}_{\text{BPR}}}{\partial \hat{y}_{u,i^+}} = -\left(1 - \sigma(\hat{y}_{u,i^+} - \hat{y}_{u,i^-})\right)\]
Negative pairs \((i^-)\) are sampled uniformly from the unobserved items; the sampling strategy matters in practice — popular items drawn as negatives produce a harder training signal and often better-calibrated rankings.
BPR is directly optimizing what we actually want from a recommender: correct relative ordering. A model trained with BPR learns that user \(u\)’s clicked item should rank above their unclicked items, which is precisely what the final NDCG or precision-at-K metric will measure.
NDCG: the evaluation metric
The standard offline metric for ranked lists is Normalized Discounted Cumulative Gain (NDCG). For a ranked list of \(K\) items, the DCG is:
\[\text{DCG}@K = \sum_{k=1}^{K} \frac{2^{\text{rel}(k)} - 1}{\log_2(k+1)}\]
where \(\text{rel}(k)\) is the relevance of the item at rank \(k\) (binary: 1 if clicked, 0 if not). NDCG normalizes DCG by the ideal DCG (achieved when all relevant items are ranked at the top):
\[\text{NDCG}@K = \frac{\text{DCG}@K}{\text{IDCG}@K}\]
NDCG penalizes relevant items appearing lower in the ranking, with a logarithmic discount — a relevant item missed entirely at rank 10 hurts less than one missed at rank 2. In practice, platforms typically evaluate at \(K \in \{5, 10, 20\}\) and report the average across all users in the test set.
Deep Learning for Recommendation
Why deep learning?
Matrix factorization captures the interaction between a user and an item as a simple dot product of latent vectors. This is a strong restriction: it assumes that user preferences and item attributes combine linearly in the latent space, and it handles only ID features (user ID and item ID). Modern platforms have thousands of additional features: user demographics, session context, item content signals, cross-item interaction history, real-time behavioral signals. Fitting these features into a pure dot-product model requires extensive feature engineering. Deep learning for recommendation lifts this restriction by learning arbitrary nonlinear combinations of all available features.
Wide & Deep (Google, 2016)
Cheng et al. (2016) introduced Wide & Deep for app recommendation in Google Play. The architecture has two components trained jointly:
The wide component is a linear model on hand-engineered cross-product features (e.g., user installed App A AND App B → recommend App C). It memorizes frequent co-occurrence patterns explicitly. The deep component is a feedforward DNN on continuous and sparse features (categorical features embedded into dense vectors, concatenated, passed through several fully connected layers with ReLU activations). The two components are combined in the output layer:
\[\hat{y} = \sigma\!\left(\mathbf{w}_{\text{wide}}^\top [\mathbf{x}, \phi(\mathbf{x})] + \mathbf{w}_{\text{deep}}^\top \mathbf{a}^{(L)} + b\right)\]
where \(\phi(\mathbf{x})\) are the cross-product features, \(\mathbf{a}^{(L)}\) is the final layer activation of the DNN, and \(\sigma\) is the sigmoid (for binary click prediction). Wide & Deep was the first large-scale demonstration that a joint memorization-generalization architecture outperforms either component alone on industrial recommendation tasks.
DeepFM (Huawei, 2017)
DeepFM (Guo et al., 2017) replaces the hand-engineered wide component with a Factorization Machine (FM) layer that automatically learns all pairwise feature interactions:
\[\hat{y}_{\text{FM}} = w_0 + \sum_i w_i x_i + \sum_{i<j} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j\]
where \(\mathbf{v}_i \in \mathbb{R}^k\) is a learnable embedding for feature \(i\). The inner product \(\langle \mathbf{v}_i, \mathbf{v}_j \rangle\) captures the interaction between features \(i\) and \(j\) without requiring explicit specification. DeepFM shares the embedding layer between the FM component and the DNN component, allowing both to benefit from the same feature representations while learning different-order interactions.
DLRM: Deep Learning Recommendation Model (Meta, 2019)
Naumov et al. (2019) at Meta describe the production recommendation architecture used across Facebook and Instagram — one of the most influential systems engineering papers of the last decade. DLRM has four stages:
- Bottom MLP: dense continuous features (age, session duration, device type) are processed through a small MLP to project them into an embedding-compatible space.
- Embedding tables: sparse categorical features (user ID, item ID, advertiser ID, page category) are looked up in large embedding tables to retrieve their dense vector representations. These tables are the dominant memory cost of the system — at Meta’s scale, the embedding tables for a single model span hundreds of gigabytes and must be sharded across thousands of machines.
- Feature interaction layer: all the dense vectors from step 1 and step 2 are concatenated, and explicit pairwise dot products are computed over all pairs. This interaction layer captures second-order feature combinations without requiring a DNN to learn them implicitly.
- Top MLP: the interaction outputs are passed through a final MLP with a sigmoid output for binary click prediction.
The most striking engineering fact about DLRM and its successors at Meta is the scale of the embedding tables. A production model for News Feed recommendation might have embedding tables with billions of entries (one vector per user, per item, per advertiser, per content category), each vector of dimension 64 to 256. At 4 bytes per float and dimension 128, a single table with 1 billion entries consumes 512 GB. Meta’s production systems maintain multiple such tables, totaling several terabytes of embedding parameters — an amount that does not fit on any number of GPUs in a single server rack. The solution is parameter sharding: the embedding tables are partitioned across thousands of machines, and embedding lookups require network communication to retrieve the relevant row from the appropriate shard. The bandwidth cost of embedding lookups, not the FLOP cost of the MLP layers, is the primary bottleneck in production DLRM training. This is why specialized hardware (Meta’s MTIA, Google’s TPU pods) and custom networking infrastructure (InfiniBand, RoCE) are essential for training at this scale.
Transformers for sequential recommendation
The transformer architecture, which had revolutionized NLP (Chapter 3), reached recommendation systems with SASRec (Kang and McAuley, 2018) and BERT4Rec (Sun et al., 2019). Both models treat the user’s interaction history as a sequence and learn to predict the next item using self-attention over the historical items.
SASRec uses unidirectional (causal) attention — attending only to past items — and is trained to predict the next item given all previous items in the session. BERT4Rec uses bidirectional attention (like BERT in NLP) with masked item prediction: randomly mask items in the sequence, and train the model to recover them from the surrounding context. Both models substantially outperform RNN-based sequential recommenders on standard benchmarks because self-attention can attend to items from any position in the history, while RNNs decay information from distant items.
Pinterest’s Pinnerformer (Pancha et al., 2022) extends this idea to long-range user histories at industrial scale: a transformer trained on the full history of user pins (potentially thousands of items over years of usage) to produce a single dense user embedding, which is then used for retrieval. Pinnerformer is notable for demonstrating that a foundation-model-style pretraining approach (large model, large data, general objective) transfers directly to recommendation quality, anticipating the current wave of “foundation recommenders.”
Full implementations of Wide & Deep, DLRM, and SASRec require PyTorch and, for DLRM at any realistic scale, TorchRec (Meta’s distributed recommendation library) or Merlin (NVIDIA’s GPU-accelerated recommendation framework). The blocks below show the architecture code for reference; run them in a Colab GPU environment. They are marked clearly and will not execute here.
# DLRM-style architecture in PyTorch (reference only — requires torch, torchrec)
import torch
import torch.nn as nn
class DLRM(nn.Module):
def __init__(self, dense_dim, embedding_dims, embedding_sizes,
bottom_mlp_dims, top_mlp_dims):
super().__init__()
# Bottom MLP: processes dense continuous features
bottom_layers = []
in_dim = dense_dim
for out_dim in bottom_mlp_dims:
bottom_layers += [nn.Linear(in_dim, out_dim), nn.ReLU()]
in_dim = out_dim
self.bottom_mlp = nn.Sequential(*bottom_layers)
# Sparse embedding tables: one per categorical feature field
self.embeddings = nn.ModuleList([
nn.EmbeddingBag(n_items, emb_dim, mode='mean', sparse=True)
for n_items, emb_dim in zip(embedding_sizes, embedding_dims)
])
# Top MLP: processes interaction outputs
# Input dim = dense output + pairwise interactions + dense
n_fields = len(embedding_dims) + 1 # +1 for bottom MLP output
n_interactions = n_fields * (n_fields - 1) // 2 + bottom_mlp_dims[-1]
top_layers = []
in_dim = n_interactions
for out_dim in top_mlp_dims:
top_layers += [nn.Linear(in_dim, out_dim), nn.ReLU()]
in_dim = out_dim
top_layers.append(nn.Linear(in_dim, 1))
self.top_mlp = nn.Sequential(*top_layers)
def forward(self, dense_x, sparse_indices):
# 1. Bottom MLP on dense features
dense_out = self.bottom_mlp(dense_x) # (B, d_dense)
# 2. Embedding lookups for sparse features
emb_outs = [emb(idx) for emb, idx in zip(self.embeddings, sparse_indices)]
all_vecs = [dense_out] + emb_outs # list of (B, d) tensors
# 3. Pairwise dot-product interactions
T = torch.stack(all_vecs, dim=1) # (B, n_fields, d)
interactions = torch.bmm(T, T.transpose(1, 2)) # (B, n_fields, n_fields)
# Upper triangle only (excluding diagonal)
n = T.shape[1]
idx = torch.triu_indices(n, n, offset=1)
interact_flat = interactions[:, idx[0], idx[1]] # (B, n_pairs)
# 4. Concatenate and pass through top MLP
x = torch.cat([dense_out, interact_flat], dim=1)
return torch.sigmoid(self.top_mlp(x)).squeeze(-1)Two-Tower Retrieval
The architectural constraint
The ranking models described in the previous section — Wide & Deep, DLRM — are powerful but slow. Scoring a single (user, item) pair through a large DNN takes hundreds of microseconds. If the item catalog has 100 million items and the system must respond in 20 milliseconds, scoring every item with the full ranking model is mathematically impossible: \(100{,}000{,}000 \times 100\,\mu\text{s} = 10{,}000\,\text{seconds}\).
The solution is a two-stage architecture. The retrieval stage uses a fast approximate method to reduce the 100-million-item catalog to a tractable candidate set (typically 500–2000 items). The ranking stage applies the expensive full model only to this small candidate set. The retrieval stage dominates recall; the ranking stage dominates precision. For retrieval to be fast, it must be reducible to a vector similarity search — which is where the two-tower architecture comes in.
The two-tower model
A two-tower model consists of two independent neural encoders: a user tower \(f_u(\cdot; \theta_u)\) that maps user features to a \(d\)-dimensional embedding, and an item tower \(f_i(\cdot; \theta_i)\) that maps item features to a \(d\)-dimensional embedding. The predicted relevance score is the dot product:
\[\hat{y}_{u,i} = f_u(u; \theta_u)^\top f_i(i; \theta_i)\]
The critical architectural choice is that the user and item towers are independent: the user embedding is computed once per user (and can be precomputed and cached), and the item embeddings are computed once per item (and can be indexed in a vector database). At query time, retrieval becomes: embed the user, then find the top-\(K\) items by maximum inner product search (MIPS) over the precomputed item index. MIPS can be solved approximately in \(O(\log m)\) with HNSW or FAISS indices (covered in Chapter 9), making retrieval of the top-1000 candidates from 100 million items feasible in under 10 milliseconds.
In-batch negative training
Training a two-tower model requires negative examples — (user, item) pairs where the item was not clicked. The standard approach is in-batch negatives: given a mini-batch of \(B\) (user, positive-item) pairs from observed clicks, treat all other items in the batch as negatives for each user. The sampled softmax loss over a batch is:
\[\mathcal{L}_{\text{two-tower}} = -\frac{1}{B}\sum_{u=1}^{B} \log \frac{\exp\!\left(f_u(u)^\top f_i(i_u^+) / \tau\right)}{\sum_{j=1}^{B} \exp\!\left(f_u(u)^\top f_i(i_j) / \tau\right)}\]
where \(i_u^+\) is the positive item for user \(u\), \(\tau > 0\) is a temperature hyperparameter (typically 0.07), and the sum in the denominator runs over all \(B\) items in the batch (including \(i_u^+\)). This is the InfoNCE / contrastive loss adapted to recommendation. In-batch negatives are efficient — each batch of \(B\) positive pairs yields \(B(B-1)\) negative pairs — but they introduce a popularity bias: popular items appear more frequently in the batch and are therefore more frequently used as negatives, causing the model to learn to down-rank popular items even when they are relevant. Google’s SBCN paper (Yi et al., 2019) proposed a correction that weights each negative by the inverse of its sampling probability; this is now standard practice at large platforms.
YouTube’s 2019 paper “Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations” (Yi et al.) describes the production two-tower system that retrieves candidate videos for the YouTube homepage. The user tower encodes watch history, search history, demographics, and device type; the item tower encodes video content features, metadata, and engagement statistics. Both towers are relatively shallow (2–3 hidden layers, embedding dimension 256), which keeps the per-query latency low. The item embeddings are indexed in a custom MIPS system and updated every few hours as new videos are uploaded and engagement patterns shift. The two-tower model is trained on billions of (user, watched-video) pairs daily. The paper’s main contribution — the sampling-bias correction — was shown to improve recall@1 by roughly 5% on the internal benchmark, which at YouTube’s scale translates to tens of millions of additional satisfying video starts per day.
Live demo: two-tower model with in-batch negatives (numpy)
Before running the cell below, predict: with a batch size of 8 and embedding dimension 16, how many negative pairs does in-batch training generate per batch? If we train for 500 steps on a dataset of 200 synthetic (user, item) positive pairs, how many total gradient steps does the user tower take?
Interpretation. The loss curve confirms that the contrastive objective drives user and item embeddings toward a geometry where clicked pairs have high cosine similarity. The PCA plot is the key diagnostic: after training, users and items from the same latent group are geometrically adjacent — the model has recovered the four-group structure without any explicit group labels, using only the co-occurrence pattern of clicks. The score heatmap shows that the high-scoring cells (orange/red) align with the positive click pairs (blue crosses), confirming that the learned embeddings would support effective retrieval. In production, the item embeddings are indexed in a FAISS HNSW index (Chapter 9); at query time, the user embedding is computed online and the top-1000 nearest items are retrieved in milliseconds.
The Multi-Stage Recommender Pipeline
Why a single model cannot do it all
The tension at the heart of industrial recommendation is between model capacity (the more complex the model, the better it can capture the interaction between user, item, and context) and serving latency (the more complex the model, the longer it takes to score a single candidate). A three-layer MLP that scores a user-item pair in 50 microseconds is fast enough to score 400 candidates in 20 milliseconds. But a DLRM-style model with large embedding lookups might take 500 microseconds per candidate — fast enough for only 40 candidates in the same budget. And a full transformer that attends over the user’s entire history might take 5 milliseconds per candidate — fast enough for only 4 candidates.
The solution that every large platform has converged on is a multi-stage funnel. Each stage trades off model quality against throughput, with progressively more expensive models applied to progressively smaller candidate sets.
Stage 1: Retrieval
The retrieval stage operates over the full item catalog — potentially hundreds of millions of candidates — and must reduce it to a manageable set of roughly 500 to 2000 items within a strict latency budget (typically under 20 milliseconds at p99). Two-tower models with precomputed item embeddings and HNSW approximate nearest-neighbor search are the standard tool. Some platforms run multiple parallel retrieval systems — one two-tower model trained on long-term preferences, one trained on real-time session signals, one based on social graph (items popular among the user’s connections) — and merge their candidate sets before ranking.
The retrieval stage optimizes for recall: it is far more acceptable to retrieve some irrelevant items (which the ranker will discard) than to miss a highly relevant item (which the user will never see). This asymmetry justifies using approximate search methods and relatively simple models at the retrieval stage.
Stage 2: Ranking
The ranking stage applies a large feature-rich model — DLRM, Wide & Deep, or a custom deep learning architecture — to the ~1000 candidates produced by retrieval, scoring each and producing a sorted list of roughly 50 to 200 items. The ranker has access to the full feature set: user history, item content features, context signals, cross-feature interactions. It typically takes 20–100 microseconds per candidate, meaning 1000 candidates can be scored in 20–100 milliseconds — within budget for most systems.
The ranking stage optimizes for precision: given the 1000 candidates that retrieval believes are plausible, find the 50–200 that will actually be clicked, watched, or liked by this user in this context. Training data for the ranker comes from the full click log — the retrieval stage’s output is logged, and actual user interactions provide positive and negative labels for the ranking model.
Stage 3: Re-ranking
The re-ranking stage takes the ranked list from the ranker and applies post-processing rules that cannot be captured by a pure relevance score. Common re-ranking operations include:
- Diversity enforcement: avoid showing the user five consecutive videos from the same creator or the same topic cluster. Maximal Marginal Relevance (MMR) selects items that are relevant to the user and dissimilar from the items already in the list.
- Freshness injection: ensure that recently uploaded or trending content appears in the list even if its engagement history is short (solving a part of the cold-start problem).
- Business constraints: sponsored content must appear at specific positions; certain content categories must meet frequency caps; partner content may receive a position boost under commercial agreements.
- Safety and policy rules: items flagged by content moderation are demoted or suppressed regardless of their predicted relevance score.
The re-ranking stage produces the final top-\(K\) list (typically 10 to 20 items) that is actually shown to the user.
Spotify’s Discover Weekly playlist — released every Monday morning to over 200 million users — is generated by a multi-stage pipeline that illustrates the full funnel. The retrieval stage uses a collaborative-filtering-based model (essentially a matrix factorization over implicit play data) to identify the universe of candidate tracks that a given user might plausibly enjoy: tracks from artists they follow, tracks played heavily by users with similar taste profiles, tracks recently added by artists whose older work the user has played. This retrieval produces hundreds of thousands of candidates. A ranking stage then scores each candidate using a feature-rich model that incorporates audio features (tempo, energy, danceability derived from Spotify’s audio analysis pipeline), editorial metadata (genre, mood, release decade), and behavioral signals (this user’s historical listening patterns by time of day, device, and listening context). A final curation stage enforces the playlist’s editorial shape: 30 tracks, spanning multiple artists, with a mix of well-known and obscure selections, avoiding consecutive tracks from the same artist. The result, when it works, feels less like an algorithm and more like a recommendation from a friend who knows your taste precisely.
Sequential and Session-Based Recommendation
The recency problem
In the matrix factorization framing, a user is represented by a single static embedding vector that summarizes their entire history. This vector changes slowly as new interactions arrive (through incremental retraining) but does not capture the user’s current intent. What a user wants at 11 PM on a Friday evening, winding down after a long week, is different from what they want at 7 AM on a Monday morning commuting to work — even if the underlying long-term preferences (the type of content they generally enjoy) are the same.
Session-based recommendation addresses this by conditioning the recommendation on the user’s recent action sequence: the items they have interacted with in the current session, typically the last 5 to 50 interactions. The intuition is that the most recent interactions are the strongest signal about current intent, and that intent changes within a session as the user navigates through content.
GRU4Rec: recurrent session models
Hidasi et al. (2016) introduced GRU4Rec, the first successful application of recurrent neural networks to session-based recommendation. The model treats the session as a sequence of item IDs and processes it with a Gated Recurrent Unit (GRU):
\[\mathbf{h}_t = \text{GRU}(\mathbf{e}_{i_t}, \mathbf{h}_{t-1})\]
where \(\mathbf{e}_{i_t} \in \mathbb{R}^d\) is the embedding of the item interacted with at step \(t\), and \(\mathbf{h}_t \in \mathbb{R}^d\) is the hidden state. The recommendation for step \(t+1\) is produced by scoring all items against the current hidden state: \(\hat{y}_{t+1, i} = \mathbf{h}_t^\top \mathbf{q}_i\). GRU4Rec is trained with a pairwise ranking loss (BPR or TOP-1) over mini-batches of sessions.
SASRec: self-attentive sequential recommendation
Kang and McAuley (2018) replaced the GRU with a transformer decoder (causal self-attention). SASRec attends over the full session history rather than compressing it into a fixed-size hidden state, allowing the model to selectively focus on the subset of past items most relevant to predicting the next item:
\[\mathbf{H} = \text{TransformerDecoder}([\mathbf{e}_{i_1}, \mathbf{e}_{i_2}, \ldots, \mathbf{e}_{i_t}])\] \[\hat{y}_{t+1, i} = \mathbf{H}_t^\top \mathbf{q}_i\]
SASRec consistently outperforms GRU4Rec on public benchmarks and has become the standard transformer baseline for sequential recommendation. Its key advantage over GRU4Rec is the ability to model long-range dependencies: if the user’s behavior in a 50-item session shows a U-turn (sci-fi interest at the start, pivoting to comedy at step 40), SASRec can attend to both ends of the session when generating the step-50 recommendation.
TikTok’s For You Page is unusual among large-platform recommenders in the extreme weight it places on real-time behavioral signals relative to long-term preferences. A user who has watched three fitness videos in the current session will immediately receive more fitness content in their next recommendation — the latency from action to recommendation adjustment is measured in seconds, not hours. This is technically achieved by maintaining a real-time feature store that ingests the user’s last \(N\) interactions and makes them available to the ranking model with sub-second staleness. The two-tower retrieval model runs on a user embedding that is updated every few minutes from this real-time feature store, while the ranking model receives the raw recent interaction sequence directly as input. TikTok’s recommendation team has been reported (based on patent filings and public interviews) to update the user embedding with exponentially decaying weights over the session — interactions from 30 seconds ago receive 10× the weight of interactions from 30 minutes ago. The result is a system that can pivot rapidly from one content cluster to another within a single session, which is part of what makes the FYP experience feel both responsive and, to critics, addictive.
Multi-Objective Optimization
The proliferation of objectives
A production recommender does not optimize a single metric. At any given platform, the engineering team is simultaneously responsible for:
- Click-through rate (CTR): the probability that the user clicks on a recommended item.
- Watch time (video platforms): the total number of seconds the user watches after clicking.
- Long-term engagement: will this user return tomorrow, next week, next month?
- Creator revenue: ensuring that creators who produce high-quality content receive sufficient views to sustain their creation.
- User satisfaction: post-experience ratings, explicit negative feedback signals, unsubscribe rates.
- Platform safety: avoiding recommendations that lead users toward harmful content clusters.
These objectives are not always aligned. Optimizing purely for CTR produces clickbait (high click rate, low watch time, low satisfaction). Optimizing purely for watch time favors long videos over short ones, and may favor emotionally engaging (but ultimately unsatisfying) content. Optimizing for creator revenue without a satisfaction correction favors high-production content from large channels over niche content that might be more relevant for a given user.
Multi-task learning
The dominant technical approach in production systems is multi-task learning: train a single backbone network that shares lower-level representations across all tasks, with separate prediction heads for each objective:
\[\hat{y}_{\text{click}} = \sigma(f_{\text{click}}(\text{shared\_repr}))\] \[\hat{y}_{\text{watch}} = \text{softplus}(f_{\text{watch}}(\text{shared\_repr}))\] \[\hat{y}_{\text{like}} = \sigma(f_{\text{like}}(\text{shared\_repr}))\]
The final ranking score combines these predictions with learned or hand-tuned weights:
\[s_{u,i} = w_1 \hat{y}_{\text{click}} + w_2 \hat{y}_{\text{watch}} + w_3 \hat{y}_{\text{like}} - w_4 \hat{y}_{\text{dislike}} - w_5 \hat{y}_{\text{report}}\]
The weights \(\{w_k\}\) encode the platform’s policy priorities: how much is a like worth relative to a click? How much should a user report penalize the ranking score? These weights are set by product and policy teams, not by the ML team, and they change over time as platform priorities evolve.
Pareto optimization and the long-term reward problem
The weighted-sum approach has a critical limitation: it assumes that the objectives can be traded off linearly and that the appropriate trade-off weights are known in advance. In practice, the relationship between short-term engagement and long-term satisfaction is nonlinear and poorly understood.
The deeper technical challenge is that long-term satisfaction is a delayed and noisy reward. A user’s session on Monday influences whether they return on Thursday — but Thursday’s return visit is confounded by dozens of other factors (content quality that week, a friend’s recommendation, a notification about a specific creator). Attributing the Thursday return to Monday’s recommendation requires careful causal reasoning, and the data needed to support that reasoning (large-scale A/B tests run over multi-week horizons) is expensive and slow to collect.
YouTube publicly described in a 2019 blog post that they had deliberately modified their recommendation system to reduce recommendations of what they called “borderline content” — videos that were not clearly policy-violating but that empirically led to progressively more extreme content consumption. This intervention reduced engagement (by some estimates, by a small but measurable amount) but was justified on the grounds that the long-term welfare effect was positive. This is an example of a constrained optimization approach: optimize engagement subject to a constraint that the recommendation does not violate certain behavioral patterns associated with harmful outcomes.
Counterfactual Evaluation and Off-Policy Methods
The logging bias problem
Offline evaluation of a recommender using historical log data faces a fundamental problem: the logs were generated by the previous policy (the recommender that was running when the data was collected). Items that the previous policy showed frequently have many logged interactions; items the policy suppressed have few or none. A new model evaluated on this logged data will appear to perform better on the popular items — not because it is actually better, but because the logged outcomes are more reliable for those items.
Formally: let \(\pi_0\) be the logging policy (the old recommender) and \(\pi_\theta\) be the new model we want to evaluate. The quantity we want is the expected reward under the new policy:
\[V(\pi_\theta) = \mathbb{E}_{i \sim \pi_\theta(u)} [r_{u,i}]\]
but we only observe rewards \(r_{u,i}\) for items \(i\) that were actually shown by \(\pi_0\). Naively computing the average reward on the logged data gives us \(V(\pi_0)\), not \(V(\pi_\theta)\).
Inverse Propensity Scoring (IPS)
The standard correction is Inverse Propensity Scoring (IPS), borrowed from the causal inference literature. For each logged interaction \((u, i, r_{u,i})\), we know the probability that the logging policy showed item \(i\) to user \(u\): \(p_0(i \mid u) = \pi_0(i \mid u)\). The IPS estimator reweights each observation by the inverse of its logging probability:
\[\hat{V}_{\text{IPS}}(\pi_\theta) = \frac{1}{|\mathcal{D}|}\sum_{(u, i, r) \in \mathcal{D}} \frac{\pi_\theta(i \mid u)}{\pi_0(i \mid u)} r_{u,i}\]
This is an unbiased estimator of \(V(\pi_\theta)\) when \(\pi_0(i \mid u) > 0\) for all items that \(\pi_\theta\) might show — the positivity assumption. The IPS estimator can have high variance when \(\pi_\theta\) assigns high probability to items that \(\pi_0\) rarely showed (small denominator), which is precisely the case when the new model differs substantially from the old one.
The doubly-robust (DR) estimator combines IPS with a direct model \(\hat{r}_{u,i}\) (a regression model that predicts the reward from features):
\[\hat{V}_{\text{DR}}(\pi_\theta) = \frac{1}{|\mathcal{D}|}\sum_{(u, i, r) \in \mathcal{D}} \left[\hat{r}_{u,i} + \frac{\pi_\theta(i \mid u)}{\pi_0(i \mid u)} \left(r_{u,i} - \hat{r}_{u,i}\right)\right]\]
The DR estimator is doubly robust: it is unbiased if either the propensity model \(\pi_0\) is correctly specified or the reward model \(\hat{r}_{u,i}\) is correctly specified — a weaker requirement than requiring both. In practice, the DR estimator has lower variance than pure IPS because the direct model absorbs the smooth part of the reward signal; the IPS correction only needs to handle the residual.
A fundamental assumption of IPS is that you know the logging policy \(\pi_0(i \mid u)\). In well-controlled systems where the recommender is fully deterministic given the features, this is tractable: you can rerun the old model on historical features to recover its scores and derive the propensities. In practice, however, many platforms use stochastic policies (exploration noise is added deliberately for bandit-style learning), change models frequently (the logging policy drifts over time), or log insufficient information to reconstruct the exact item-level probabilities. In these cases, the propensity must be estimated from the data — which introduces its own estimation error and can cause the IPS estimator to be biased in practice. This is an active area of research (doubly-robust learning, self-normalizing IPS, clipped IPS) and an underappreciated source of evaluation error in deployed systems.
The Cold-Start Problem
Three types of cold start
The cold-start problem refers to the difficulty of making good recommendations for entities with little or no interaction history. It manifests in three forms:
New user cold start: a user who has just created an account has no rating history, no behavioral signals, and no session data. Any user-based CF or MF model produces degenerate predictions because the user’s embedding is at initialization (random noise). The platform must either ask for explicit preferences during onboarding (which most users find annoying) or rely on non-personalized signals (current trending content, geographically popular items, demographically segmented defaults).
New item cold start: a newly uploaded video, article, or product has zero interactions. Its item embedding is random noise. A recommendation system that ranks items purely by their learned embedding will never show new items because their predicted scores are unreliable — creating a rich-get-richer dynamic where popular items accumulate engagement and new items never break through.
New context cold start: a returning user in an entirely new context (new device, new location, new time of day) may have sufficient long-term history but insufficient context-specific signals to make accurate context-aware recommendations.
Solutions
Content features are the primary tool for item cold start. Instead of relying on collaborative signals (which require historical interactions), the system uses the item’s observable features: text descriptions, visual features (image/video embeddings), audio features, metadata (category, tags, creator, upload time). A two-tower model with content-rich item towers can produce reasonable item embeddings for brand-new items by passing their features through the trained tower — no interactions required.
Social signals address user cold start: if a new user has connected their social accounts or has friends on the platform, their social graph provides an informative prior. A user whose existing friends all listen to K-pop and indie folk is much more likely to enjoy those genres than a random new user, even before their first interaction.
Exploration bonuses address new-item cold start from the demand side: the recommender deliberately shows new items to a small fraction of users — selected for their diversity of tastes, their tendency to engage with new content, and their history of providing reliable feedback — and observes the engagement signal. This is the explore-exploit trade-off from multi-armed bandit theory.
Thompson Sampling for new items
Thompson sampling is the principled Bayesian approach to the exploration-exploitation trade-off. For each new item \(i\), maintain a Beta posterior over its “true” click-through rate:
\[\text{CTR}_i \sim \text{Beta}(\alpha_i, \beta_i)\]
where \(\alpha_i\) is initialized to 1 (one virtual success) and \(\beta_i\) to 1 (one virtual failure). After each exposure where the item is clicked, \(\alpha_i \leftarrow \alpha_i + 1\); after each non-click, \(\beta_i \leftarrow \beta_i + 1\). At ranking time, instead of ranking by the posterior mean \(\frac{\alpha_i}{\alpha_i + \beta_i}\), sample a CTR estimate from the current posterior and rank by the sample. Items with high uncertainty (few observations) have wide posteriors and occasionally sample high values — injecting them into the list. Items with low uncertainty (many observations) sample reliably near their true CTR. The algorithm naturally transitions from exploration (early, high uncertainty) to exploitation (later, low uncertainty) as observations accumulate.
::: {.callout-warning title=“Exploration at scale: unintended consequences”> Exploration policies that inject new items into the recommendation stream are powerful but carry risks at scale. If 1% of YouTube’s recommendation slots are dedicated to exploring new videos, and YouTube serves 1 billion recommendation events per day, that is 10 million exploration slots. Even a very small fraction of harmful new content in the uploaded corpus — if the content-moderation system misses it before the exploration policy surfaces it — can be shown to millions of users. This is why platforms run content moderation pipelines before any new content is eligible for the exploration pool, and why the exploration fraction is kept small and carefully monitored for safety signals. :::
Ethical and Long-Run Considerations
The filter bubble hypothesis
The term filter bubble, coined by Eli Pariser (2011), describes the effect of personalization algorithms on the diversity of information a user receives. If a recommender shows users only content that reinforces their existing beliefs and preferences, the resulting information diet may be narrower than what the user would encounter in an unfiltered media environment — potentially contributing to polarization, epistemic isolation, and susceptibility to misinformation.
The empirical evidence on filter bubbles is more nuanced than the popular framing suggests. Bakshy et al. (2015), studying Facebook’s news feed, found that the algorithmic feed did expose users to slightly less ideologically cross-cutting content than a fully random feed — but that the dominant source of ideological homophily was the users’ own choice of which links to click, not the algorithm’s curation. The algorithm’s marginal contribution to the filter bubble was real but smaller than the users’ own selection behavior. Allcott et al. (2020) conducted a large experiment in which participants were paid to deactivate Facebook for four weeks and found that deactivation reduced political polarization, online political activity, and well-being — results that are difficult to interpret without careful causal modeling.
Addictive design and engagement maximization
The more direct concern is not filter bubbles but addictive design: the deliberate or inadvertent construction of recommendation systems that maximize time spent on platform, regardless of whether that time generates genuine satisfaction or value. The mechanisms are well understood from behavioral psychology — variable reward schedules, social comparison, fear of missing out — and are structurally embedded in platforms that measure success by DAU (daily active users), session length, and return visit frequency.
The technical path toward systems that optimize for satisfaction rather than engagement is challenging but not impossible. Several approaches are active research areas:
Revealed versus stated preferences: users’ stated preferences (from explicit ratings or surveys) often diverge from their revealed preferences (behavioral signals). A user who watches three hours of outrage-generating political content per day is revealing a behavioral preference but may not be satisfied with the outcome. Systems that give weight to stated preferences — short post-session surveys, explicit “I liked this” vs. “I watched this” signals — appear to produce better long-run retention.
Diversity and serendipity: systems explicitly penalized for serving homogeneous content (measured by cluster entropy of the recommended list) show better long-run engagement in A/B tests, because users who are continuously surprised by relevant new content have more reason to return. This is empirically documented in several published studies from Netflix and Spotify.
YouTube has published that they conduct large-scale surveys asking users to rate their satisfaction with specific recommendations on a five-point scale immediately after watching. These survey responses are used as a direct training signal in a separate “satisfaction model” that runs alongside the engagement model in the ranking stage. The final ranking score blends both: a video that is predicted to get high watch time and high satisfaction rating is ranked above one predicted to get high watch time but low satisfaction. YouTube reported in 2022 that incorporating satisfaction signals had reduced “regretted watch time” — a metric measuring how often users reported wishing they had not watched a video — by a measurable percentage, while not reducing total watch time. The practical lesson for system designers is that the survey signal, though expensive to collect at scale, is more directly aligned with the true objective than any behavioral proxy.
Production Realities — Latency, Throughput, Freshness
Latency requirements
The end-to-end latency budget for a recommendation request is typically 50–200 milliseconds from the user’s scroll action to the rendered content on their screen. Within this budget, the recommender system is allocated roughly 20–50 milliseconds. This breaks down across the pipeline stages:
- Retrieval (two-tower + HNSW ANN search): under 10 milliseconds at p99 for a catalog of 100 million items. FAISS HNSW search on a precomputed item index at this scale, running on a dedicated retrieval server with the index loaded in memory, achieves sub-5ms retrieval of top-1000 candidates.
- Feature serving: given the 1000 candidate item IDs, the system must fetch all features (content features, recent engagement statistics, creator metadata) from a low-latency feature store. This step often takes 5–15ms and is frequently the latency bottleneck in production.
- Ranking (DNN scoring of 1000 candidates): 10–30 milliseconds on a GPU inference server, depending on model complexity. DLRM-style models with embedding lookups are typically the expensive component.
- Re-ranking and business logic: <5ms (in-memory rule application).
Meeting p99 latency requirements (not just average latency) is critical: the 1% of slowest responses typically occur during traffic spikes, which are precisely the moments when the system is under the most load and when latency failures have the highest visibility.
Training at petabyte scale
The training data for a production recommender at TikTok, YouTube, or Instagram scale is measured in petabytes per day. A single day of user interactions on a platform with 1 billion daily active users, each with 100 recommendation events per session, produces 100 billion (user, item, outcome) triples. At 100 bytes per training example (including all features), this is 10 terabytes of training data per day.
Continuous training — updating the model on a rolling window of recent data with hourly or daily refresh cycles — is standard. The embedding tables, which are the largest component of the model, are updated by asynchronous SGD on distributed parameter servers: gradient updates from each training worker are pushed to a central parameter server, which applies them with some staleness (the gradient may reflect model parameters from a few seconds ago). The staleness is acceptable because the signal-to-noise ratio in the training data is low and the model changes slowly relative to the update frequency.
Online learning — updating the model in real time as interactions arrive — is the frontier. TikTok has described (in patents and public interviews) a system that updates recommendation model parameters on a 5-minute cycle: a streaming pipeline ingests interactions, computes gradients, and applies them to the online model within minutes of the interaction occurring. This ensures that a viral trend that emerges at 2 PM is incorporated into the recommendation model by 2:05 PM, rather than waiting for the next daily training cycle.
Embedding table sharding
The embedding tables in a large-scale recommender — mapping user IDs, item IDs, and categorical features to dense vectors — are the dominant memory component of the system. At Meta’s scale, a single model might have:
- User ID embedding table: \(10^9\) users × 128 dimensions × 4 bytes = 512 GB
- Item ID embedding table: \(10^8\) items × 128 dimensions × 4 bytes = 51 GB
- Advertiser ID table, page category table, device type table, etc.
No single machine has 512 GB of GPU memory. The solution is table sharding: partition each embedding table row-wise across hundreds of machines, with a consistent hash routing each ID to its responsible shard. An embedding lookup for a batch of training examples requires network communication to multiple shards, and the resulting partial embeddings are gathered and concatenated on the training worker. The bandwidth cost of this all-to-all communication — not the arithmetic cost of the neural network — is the primary bottleneck in DLRM training at scale.
Meta’s hardware response to this bottleneck is the MTIA (Meta Training and Inference Accelerator), a custom ASIC designed to maximize embedding lookup throughput and minimize interconnect latency for recommendation workloads. Google has the equivalent in TPU pods with custom interconnects. ByteDance (TikTok’s parent) has a dedicated ML infrastructure team numbering in the hundreds whose primary responsibility is the efficiency of recommendation model training and serving — a useful indicator of how central the recommender is to the business.
Mini Case Study — Designing a Short-Video Recommender
The problem specification
Design a recommender system for a short-video platform with the following scale characteristics:
- Users: 500 million daily active users.
- Item catalog: 100 million videos, with 5 million new videos uploaded per day.
- Request rate: 200 billion recommendation events per day (approximately 2.3 million per second at peak).
- Latency requirement: 50ms end-to-end at p95; 100ms at p99.
- Objectives: watch time, video completion rate, share rate, follow rate (positive); report rate, skip rate (negative).
Stage 1: Two-tower retrieval
The retrieval system maintains a FAISS HNSW index of all 100 million video embeddings, updated every 15 minutes to incorporate newly uploaded videos and refreshed engagement statistics. The item tower encodes:
- Video content features: a vision model (EfficientNet or similar) embedding of the video thumbnail; an audio model embedding of the audio track; a text embedding of the title, description, and auto-generated captions.
- Engagement features: binned historical CTR, watch completion rate, share rate, for the video as a whole and stratified by viewer demographics.
- Creator features: creator follower count, historical video quality scores, account age.
The user tower encodes: - Long-term interest vector: aggregated from the last 500 interactions (played videos), weighted by watch completion (a 100% completion is weighted higher than a 10% watch). - Short-term interest vector: aggregated from the last 5 interactions in the current session, with exponential decay by recency. - Context features: device type, time of day, day of week, network connection quality.
Both towers produce 256-dimensional L2-normalized embeddings. The retrieval call returns the top-1000 candidates by dot product, in under 8ms at p99.
The model has approximately 200 million parameters across both towers, dominated by the video content feature embeddings. It is retrained daily on the previous 7 days of click data using in-batch negatives with sampling-bias correction.
Stage 2: DLRM-style ranker
The ranking model scores the 1000 retrieval candidates. Architecture:
- Embedding tables: user ID (500M users × 128 dims = 64 GB), video ID (100M videos × 128 dims = 13 GB), creator ID, category ID, device type. Total embedding table size: approximately 100 GB, sharded across 50 parameter server nodes.
- Dense features: session-level features (session length, recent interaction sequence features), context features, user-video feature crosses.
- Bottom MLP: 3 layers (512, 256, 128 units), processing dense features.
- Feature interaction: pairwise dot products over all embedding vectors and bottom MLP output.
- Top MLP: 3 layers (512, 256, 1 unit per objective head).
- Multi-task output heads: six heads — one each for watch time, completion, share, follow, report, skip — with task-specific loss functions.
Total model parameters: approximately 500 million in the MLPs and 100 billion token equivalents in the embedding tables. The embedding table parameters dwarf the network parameters — this is characteristic of all production-scale recommenders.
Serving: the ranker runs on a dedicated GPU inference cluster. Each ranking call scores 1000 candidates in 25ms at p95 by batching the embedding lookups and MLP forward passes.
Stage 3: Diversity-aware re-ranker
The re-ranker takes the top-200 candidates from the ranker and applies two modifications:
Diversity enforcement using Maximal Marginal Relevance (MMR). Given the ranked list from the ranker, iteratively select the next video as:
\[i^* = \arg\max_{i \in \mathcal{C} \setminus \mathcal{S}} \left[\lambda \cdot s_i - (1-\lambda) \cdot \max_{j \in \mathcal{S}} \text{sim}(i, j)\right]\]
where \(s_i\) is the ranker score for video \(i\), \(\mathcal{S}\) is the set of already-selected videos, \(\text{sim}(i, j)\) is the cosine similarity between the content embeddings of videos \(i\) and \(j\), and \(\lambda \in [0,1]\) controls the relevance-diversity trade-off (typically \(\lambda = 0.7\) in production). MMR ensures that consecutive videos in the feed are not too similar in content — preventing the user from being stuck in a narrow topic for more than two or three consecutive videos.
Freshness injection: the top 5 of the final 20 slots are reserved for videos uploaded within the last 24 hours that scored in the top 5% of the ranker among new items. This guarantees a minimum throughput of new content into the recommendation stream, addressing the cold-start problem for new videos.
Multi-objective combination
The final ranking score for the re-ranker’s input is:
\[s_i = 3.0 \cdot \hat{y}_{\text{watch}} + 2.0 \cdot \hat{y}_{\text{complete}} + 1.5 \cdot \hat{y}_{\text{share}} + 0.5 \cdot \hat{y}_{\text{follow}} - 5.0 \cdot \hat{y}_{\text{report}} - 1.0 \cdot \hat{y}_{\text{skip}}\]
The weights are set by product policy. The large negative weight on report rate reflects the platform’s policy that a user report is a strong negative signal — it should be penalized 5× more strongly than a watch event is rewarded.
Parameter count and feedback loops
The combined system has approximately:
| Component | Parameters |
|---|---|
| Two-tower retrieval (both towers) | 200 million |
| DLRM ranker MLP layers | 500 million |
| DLRM embedding tables | ~100 billion (token-equivalent) |
| Total model state | ~100 billion parameters |
This is in the same order of magnitude as large language models — but the vast majority is in the embedding tables, not in the neural network computation graph. The compute cost of a forward pass is modest (small MLP layers); the memory and communication cost of embedding lookups is the engineering challenge.
Feedback loop dangers: the system creates a self-referential loop — it recommends videos, users interact with those recommendations, and the interactions become training data for the next model version. Without deliberate intervention, this loop amplifies whatever biases existed in the initial model. Videos that the first model happened to recommend frequently accumulate the most interaction data, making the next model more confident in recommending them, creating a feedback loop that disfavors new creators and systematically underestimates the quality of content the system rarely explores. Preventing this requires active management of the exploration fraction, regular audits of the distribution of content the system recommends, and deliberate diversity incentives in the re-ranking stage — engineering interventions that are easy to describe and hard to maintain consistently across the rapid release cycles typical of a large platform’s ML team.
Live demo: MMR diversity-aware re-ranking
Interpretation. The topic distribution comparison (left panel) is the clearest indicator of MMR’s effect. The greedy selection over-represents Topic 0 (the most popular topic, whose videos tend to have higher ranker scores) and completely excludes Topic 3 (the niche topic). MMR’s selection includes at least two videos from each topic — the penalty for similarity to already-selected items counteracts the ranker’s implicit bias toward popular topics. The ranker score comparison (center) shows that MMR accepts a modest score reduction (the average score falls by roughly 5–8%), which is the explicit cost of the diversity constraint. The pairwise similarity heatmap (right) shows low off-diagonal values in the MMR selection — the selected videos are genuinely dissimilar in content space, confirming that the algorithm has successfully spread coverage across the content landscape.
Closing — Recommenders as the Defining Application of AI
The recommender system is not one application of machine learning among many. It is the application that most directly mediates the relationship between artificial intelligence and human attention at scale. Every methodological development in the field — from collaborative filtering in the 1990s, to matrix factorization in the 2000s, to deep learning in the 2010s, to transformers and foundation models now — has found its largest-scale deployment in recommendation. The systems that result are, collectively, the most impactful software ever built: they determine, for billions of people, which news stories they read, which videos they watch, which products they buy, which creators they follow, and — through all of these — what they believe about the world.
This chapter has covered the technical arc from the mathematical foundations (collaborative filtering, matrix factorization, BPR) through the modern production architecture (two-tower retrieval, DLRM ranking, MMR re-ranking, multi-objective optimization) to the hard problems that remain open (counterfactual evaluation, long-term satisfaction, cold start, ethical design). Three themes run through the entire discussion.
The gap between what we measure and what we want. Every recommendation system in production optimizes a proxy — clicks, watch time, shares — for the true objective, which is something like “long-term user flourishing” or “efficient matching of content to interest.” The gap between the proxy and the objective is not merely a technical inconvenience; it is the source of most of the harmful dynamics that critics of algorithmic curation have documented. Closing this gap — building systems whose measured objectives are genuinely aligned with human welfare — is both the most important open problem in recommendation research and the most politically and commercially contested.
The convergence of the stack. A modern production recommender consumes every layer of the modern AI stack: raw behavioral data (Chapter 1), text and content embeddings (Chapters 3–5), foundation models for content encoding (Chapter 9), graph structure for social signals (Chapter 8), and vector databases for fast retrieval (Chapter 9). Understanding the recommender requires understanding all of these layers, which is why it belongs at the end of this book rather than the beginning.
The frontier: foundation-model-native recommendation. The next generation of recommenders will not use foundation models as feature extractors for a downstream ranking model. They will be foundation models themselves — trained end-to-end on the multi-modal content streams of the platform, capable of generating item recommendations as natural language rather than ranked ID lists, and capable of reasoning explicitly about user preferences rather than inferring them from behavioral signal alone. Early prototypes (P5, M6-Rec, GPT4Rec) have demonstrated that a language model can be fine-tuned to generate recommendation lists as text, with competitive or superior accuracy relative to traditional ranking models on standard benchmarks. The shift to generative recommendation is not imminent at the scale of TikTok or YouTube — the serving costs are prohibitive with current transformer architectures — but the direction is clear. The recommender of 2030 will be substantially more capable of explaining its recommendations, negotiating with users about their preferences, and reasoning about the long-term consequences of what it surfaces. Whether it will also be more aligned with human welfare is a question that depends as much on the incentives of the organizations that build it as on the technical properties of the models themselves.
Prof. Xuhu Wan · HKUST · Modern AI Stack for Social Data · 2026 Edition