• 📖 Cover
  • Contents

Chapter 9: Vector Databases and Embedding Infrastructure

About This Chapter

Every chapter until now has treated the embedding as an endpoint: you pass text through a transformer encoder, collect the dense vector that emerges, and use it as a feature for some downstream classifier or generative model. That framing is correct as far as it goes. But it omits the most operationally consequential question: once you have a billion embeddings, how do you find the one that is most similar to a new query in under fifty milliseconds?

The answer is a vector database — a storage and retrieval system purpose-built for high-dimensional floating-point vectors. The category barely existed in 2019. By 2025, Pinecone had raised $138 million at a $750 million valuation, Weaviate had raised $68 million at approximately $200 million, and Chroma had raised $20 million at seed. More telling than the venture numbers: pgvector, a free PostgreSQL extension that adds a vector similarity operator to ordinary SQL, became the fastest-growing PostgreSQL extension of 2024–2025, with adoption accelerating alongside every new RAG (Retrieval-Augmented Generation) deployment. The reason for all of this investment and adoption is simple: embeddings are useless without a way to find the nearest neighbor at scale.

This chapter builds the full picture from the ground up. We begin with the mathematical statement of the nearest-neighbor problem and the performance gap that the entire industry exists to close. We derive the three standard distance metrics and show — with live demonstrations — how the choice of metric changes which items are retrieved. We prove the curse of dimensionality and show empirically how it breaks classical indexing. We then develop, in full mathematical and algorithmic detail, the two dominant families of approximate nearest-neighbor (ANN) algorithms: HNSW (Hierarchical Navigable Small World), which is the graph-based method behind Faiss, Qdrant, and Weaviate’s default index; and IVF-PQ (Inverted File Index with Product Quantization), which is the billion-scale workhorse used in Faiss-GPU and Milvus. We survey the production vector database landscape, cover hybrid search and metadata filtering, and close with a fully worked architectural case study: building a semantic search system for 100 million tweets.

The trajectory of the industry is clear. In the 2000s, the default infrastructure question for a new application was “which relational database?” In the 2010s, it was “which NoSQL store?” In the 2020s, it is “which vector database?” — and by the end of this chapter, you will be equipped to answer that question with engineering precision rather than vendor marketing.

Reference

This chapter builds on the embedding foundations introduced in Chapters 3–5 of this book (transformer encoders, sentence embeddings, and multimodal encoders) and assumes familiarity with cosine similarity and the pretraining/fine-tuning paradigm. Key external references are: Malkov & Yashunin (2018), “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs” (IEEE TPAMI); Jégou et al. (2011), “Product quantization for nearest neighbor search” (IEEE TPAMI); Beyer et al. (1999), “When is ‘nearest neighbor’ meaningful?”; Johnson et al. (2021), “Billion-scale similarity search with GPUs” (Faiss paper, IEEE Trans. Big Data). Non-executing python blocks throughout require faiss-gpu, pinecone-client, or psycopg2 with the pgvector extension — clearly marked throughout. All live cells run in the browser using only numpy, scipy, scikit-learn, pandas, and matplotlib.


Table of Contents

  1. The Nearest-Neighbor Problem
  2. Distance Metrics
  3. The Curse of Dimensionality
  4. Approximate Nearest Neighbor — the Tradeoff
  5. HNSW: Hierarchical Navigable Small World
  6. IVF: Inverted File Index
  7. Product Quantization
  8. GPU-Accelerated ANN
  9. Production Vector Databases
  10. Hybrid Search — Vector + Keyword + Filter
  11. Metadata Filtering
  12. Embedding Drift and Re-Indexing
  13. Operational Realities
  14. Mini Case Study — Semantic Search for Social Media
  15. Closing — Vector DBs as Infrastructure

The Nearest-Neighbor Problem

Statement of the problem

The nearest-neighbor problem is stated simply. Given a corpus of \(N\) vectors \(\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_N \in \mathbb{R}^d\) stored in an index, and a query vector \(\mathbf{q} \in \mathbb{R}^d\), find the \(k\) vectors in the corpus that are closest to \(\mathbf{q}\) under a specified distance function \(\text{dist}(\cdot, \cdot)\). The result is an ordered list of \(k\) indices:

\[\text{NN}_k(\mathbf{q}) = \underset{S \subseteq [N],\; |S|=k}{\arg\min} \sum_{i \in S} \text{dist}(\mathbf{q}, \mathbf{v}_i)\]

This is well-defined for any metric. The difficulty is not the definition; it is the scale.

The performance gap

A naive implementation scans every vector in the corpus, computes the distance to the query, and returns the top \(k\) by sorting. This exact linear scan requires \(N\) distance evaluations, each costing \(O(d)\) floating-point operations. Total cost per query: \(O(Nd)\).

At the scale of a production social media analytics system:

Parameter Value
Corpus size \(N\) \(10^9\) (one billion embeddings)
Embedding dimension \(d\) 768 (BERT-large, multilingual-e5)
FLOPS per distance evaluation 768 multiply-adds = 1,536 FLOPs
Total FLOPs per query \(1.536 \times 10^{12}\) (1.5 TFLOPs)
A100 GPU throughput ~312 TFLOPs (FP16)
Minimum latency (theory) ~5 ms
Realistic latency (memory bandwidth) 500 ms – 5 s

The bottleneck is not arithmetic throughput; it is memory bandwidth. A 1B-vector index at \(d = 768\) in FP32 requires 3 TB of raw data — far exceeding GPU memory. Streaming 3 TB from disk or DRAM to compute one query incurs tens of seconds of latency, making exact linear scan incompatible with any interactive or real-time application.

The entire field of approximate nearest-neighbor (ANN) search exists to eliminate this gap. ANN algorithms achieve query latencies of 1–10 milliseconds on billion-scale indexes by sacrificing a small amount of recall accuracy — typically 5–15% of the true nearest neighbors are missed — in exchange for orders-of-magnitude faster retrieval.


Distance Metrics

The three primary metrics

Three distance functions dominate practical vector search. Each encodes a different geometric assumption about what “close” means.

Euclidean distance (\(L_2\)) measures the straight-line distance between two points in \(\mathbb{R}^d\):

\[d_{\text{Euclidean}}(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|_2 = \sqrt{\sum_{i=1}^{d} (u_i - v_i)^2}\]

Euclidean distance is magnitude-sensitive: a vector with large norm will be “far” from all others regardless of its direction. It is the natural choice when the absolute scale of the embedding carries semantic information — for example, in certain image embedding spaces where brightness and contrast are encoded in vector magnitude.

Cosine similarity measures the angle between two vectors, ignoring magnitude:

\[\text{cos\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|_2 \|\mathbf{v}\|_2} = \frac{\sum_{i=1}^d u_i v_i}{\sqrt{\sum_i u_i^2}\sqrt{\sum_i v_i^2}}\]

Cosine similarity lies in \([-1, 1]\); the corresponding cosine distance is \(1 - \text{cos\_sim}(\mathbf{u}, \mathbf{v})\), lying in \([0, 2]\). Cosine similarity is the dominant metric for text embeddings: two sentences with the same meaning but one expressed in a long paragraph and one in a short phrase will have very different \(L_2\) distances but near-identical cosine similarity. Modern sentence encoders (SBERT, multilingual-e5, BGE) are explicitly trained to maximise cosine similarity between semantically equivalent sentences.

Dot product (inner product) similarity:

\[\text{dot}(\mathbf{u}, \mathbf{v}) = \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^d u_i v_i\]

Note that cosine similarity is simply the dot product after \(L_2\)-normalisation: if \(\hat{\mathbf{u}} = \mathbf{u}/\|\mathbf{u}\|_2\) and \(\hat{\mathbf{v}} = \mathbf{v}/\|\mathbf{v}\|_2\), then \(\text{cos\_sim}(\mathbf{u}, \mathbf{v}) = \hat{\mathbf{u}} \cdot \hat{\mathbf{v}}\). For this reason, when all corpus vectors are \(L_2\)-normalised at index time (a free \(O(d)\) operation), cosine search and dot-product search become identical. Most production deployments normalise at ingestion and use dot-product internally, because dot-product can be expressed as a single matrix-vector multiplication — making it amenable to batched GPU acceleration.

Other metrics. Manhattan (\(L_1\)) distance \(\sum_i |u_i - v_i|\) is occasionally used for sparse embeddings. Hamming distance — the number of bit positions that differ — applies to binary quantized vectors and is computed via popcount instructions at speeds 8–32× faster than floating-point operations, at the cost of quantization error. We will return to binary and quantized representations in the Product Quantization section.

When to use which metric

The practical decision tree is:

  1. Are embeddings produced by a model that was trained with a cosine or dot-product objective (virtually all modern sentence encoders)? Use cosine (or equivalently, normalise and use dot product).
  2. Are embeddings dense features from a classical vision model where magnitude carries information? Use \(L_2\).
  3. Are you working with binary hash codes or very high-dimensional sparse binary vectors? Use Hamming.
  4. Is your index backed by matrix multiplication hardware (GPU, TPU) and are vectors normalised? Use dot product directly.
In practice: OpenAI’s vector store

OpenAI’s Assistants API includes a native vector store (released 2024) that indexes file embeddings using OpenAI’s text-embedding-3-small (1536-dim) and text-embedding-3-large (3072-dim) models. The API normalises all embeddings to unit \(L_2\) norm at ingestion and retrieves by cosine similarity — effectively dot product on the normalised vectors. This is the production decision made by one of the world’s largest-scale embedding deployments: normalise everything, use dot product internally, expose cosine semantics to users. For practitioners building RAG pipelines, this is the safe default unless your specific embedding model has been trained with an explicit \(L_2\) objective.

Live demo: four metrics on a toy corpus

The cell below constructs 15 short toy “documents” as random vectors in \(\mathbb{R}^8\) and computes all four distance metrics between a query and the corpus. It then shows how the top-3 nearest neighbors differ by metric — concretely illustrating why metric choice is a modelling decision, not an implementation detail.

Interpretation. The rank heatmap reveals that corpus items which are closest under Euclidean distance are often not the closest under cosine distance — a direct consequence of ignoring vector magnitude. The Jaccard similarities between the top-3 sets quantify this disagreement: for randomly distributed vectors, agreement is roughly 0.2–0.5 across metric pairs. In production, this means choosing the wrong metric can silently degrade retrieval quality: your system returns syntactically similar but semantically distant documents, or vice versa. Always verify that your distance metric matches the objective used to train the embedding model.


The Curse of Dimensionality

The formal result

The curse of dimensionality, in the context of nearest-neighbor search, was formalised by Beyer et al. (1999) in the paper “When is ‘nearest neighbor’ meaningful?” The core theorem concerns the ratio of the maximum to minimum distance from a query point to a random corpus in \(\mathbb{R}^d\). Let \(D_{\max}^{(d)}\) and \(D_{\min}^{(d)}\) denote the maximum and minimum \(L_2\) distances from a fixed query to \(N\) i.i.d. random corpus points in \(d\) dimensions. The theorem states that under mild conditions on the data distribution:

\[\lim_{d \to \infty} \frac{D_{\max}^{(d)} - D_{\min}^{(d)}}{D_{\min}^{(d)}} \to 0\]

In other words, as dimensionality grows, all distances become increasingly similar. The nearest neighbor and the farthest neighbor are approximately equidistant from the query. The relative contrast — the signal that nearest-neighbor search relies upon — vanishes.

Why this breaks classical indexes

Classical spatial indexing structures — k-d trees (Bentley, 1975) and ball trees (Omohundro, 1989) — exploit the fact that in low dimensions, a bounding hyperrectangle (for k-d) or a bounding hypersphere (for ball trees) can prune large fractions of the corpus without examining individual points. A query ball that intersects few bounding volumes leads to fast search: instead of examining all \(N\) points, the tree examines only \(O(\log N)\).

This pruning works when bounding volumes are geometrically tight and when a query’s nearest-neighbor ball is small relative to the bounding volumes. Both properties fail in high dimensions. The nearest-neighbor ball of a query point in \(d = 768\) must be expanded to include nearly the entire corpus before encountering the true nearest neighbor, because all points are approximately equidistant. The k-d tree degenerates to a linear scan above approximately \(d \approx 20\).

This is not a flaw in k-d trees; it is a fundamental property of high-dimensional geometry. The entire ANN literature — HNSW, IVF, LSH, ScaNN — exists because of this failure. Every ANN method replaces the geometric pruning of k-d trees with a different strategy: graph navigation (HNSW), vector space partitioning (IVF), or locality-sensitive hashing (LSH).

Before running the cell below: as \(d \to \infty\), the ratio \((D_{\max} - D_{\min}) / D_{\min}\) should approach 0. What value does this ratio approach? And at what dimensionality do you expect k-d tree search to become effectively \(O(N)\)? Form a quantitative prediction before examining the simulation output.

Interpretation. The simulation makes the curse concrete. At \(d = 2\), the nearest and farthest neighbors differ by a factor of roughly 5–15 in distance — there is a strong signal telling you which corpus point to return. At \(d = 1000\), that ratio collapses toward 1.1 or lower: the nearest and farthest neighbors are nearly indistinguishable. The right panel shows the distribution of distances concentrating around its mean, the variance shrinking to near zero. This is not a numerical artifact; it is a direct consequence of the law of large numbers applied across dimensions. The practical implication: any search method that relies on measuring how much closer the nearest neighbor is than its competitors — which is exactly what k-d trees and ball trees do — will fail in the dimensionalities (\(d = 384, 768, 1024, 3072\)) used by modern embedding models.


Approximate Nearest Neighbor — the Tradeoff

The three-way tradeoff

Every ANN algorithm navigates a three-way tradeoff among recall, latency, and memory. Understanding this tradeoff is the first analytical skill required to select and configure a vector database.

Recall@\(k\) is the fraction of the true \(k\) nearest neighbors that are returned by the approximate method:

\[\text{Recall@}k = \frac{|\text{ANN}_k(\mathbf{q}) \cap \text{NN}_k(\mathbf{q})|}{k}\]

where \(\text{NN}_k(\mathbf{q})\) is the exact top-\(k\) set and \(\text{ANN}_k(\mathbf{q})\) is the approximate result. Recall@10 = 0.90 means the ANN method returns 9 of the true 10 nearest neighbors. Perfect recall (1.0) is achieved only by exact search; practical production systems target Recall@10 \(\geq 0.90\), with high-precision applications (drug discovery, legal search, financial compliance) demanding Recall@10 \(\geq 0.97\).

Latency is the wall-clock time per query (single-threaded) or the p99 latency under concurrent load. Production targets vary: a semantic search bar demands p99 \(< 50\) ms; a batch re-ranking pipeline tolerates 500 ms; a real-time recommendation system may demand p99 \(< 5\) ms. Latency is controlled primarily by the ANN algorithm’s search budget — how much of the index it examines before stopping.

Memory is the RAM required to hold the index. Exact indexes store full-precision vectors (\(32d\) bits per vector, or \(16d\) in fp16). ANN indexes compress vectors (Product Quantization reduces this to \(m \cdot \log_2 k\) bits) and add graph or partition structure. At \(N = 10^9\), even a 4-bit quantized index of \(d = 768\) requires \(\frac{768 \times 4}{8} \times 10^9 = 384\) GB — distributed RAM.

The tradeoff is irreducible: improving recall requires examining more of the index (higher latency) or storing more precise representations (higher memory). Every ANN algorithm exposes hyperparameters that slide the operating point along this surface. Understanding which parameters move you in which direction is the practical skill this chapter develops.

In practice: Spotify’s Annoy legacy

Before HNSW became dominant (post-2018), Spotify’s recommendation system relied on Annoy (Approximate Nearest Neighbors Oh Yeah), an open-source library built by Erik Bernhardsson at Spotify in 2013. Annoy uses random projection trees: at each node, a hyperplane through two random data points splits the space; a query descends multiple trees and takes the union of leaf candidates. Annoy requires zero RAM beyond the data (it memory-maps the tree file), making it practical on servers with limited RAM per CPU. Spotify ran Annoy in production for their music recommendation pipeline from approximately 2013 to 2020, serving hundreds of millions of users. HNSW’s arrival — offering higher recall at lower query latency — eventually triggered Spotify’s migration. The lesson: the “best” ANN algorithm is the one whose tradeoffs match your operational constraints, and those constraints change as your system scales.


HNSW: Hierarchical Navigable Small World

Motivation: small-world graphs

The Hierarchical Navigable Small World algorithm (Malkov & Yashunin, 2018, IEEE TPAMI) draws its inspiration from the small-world phenomenon in social networks: in a graph with \(N\) nodes, if each node has a mix of local connections (to nearby nodes) and long-range connections (to distant nodes), any two nodes can be connected by a path of length \(O(\log N)\). Navigating such a graph greedily — at each step, move to the neighbor closest to the target — finds the destination quickly.

HNSW constructs a multi-layer graph over the corpus vectors. The bottom layer (layer 0) is a complete proximity graph: every vector is a node, connected to its \(M\) nearest neighbors. Higher layers are increasingly sparse subgraphs: each vector is present in layer \(\ell\) with probability exponentially decreasing in \(\ell\), controlled by a level assignment parameter \(m_L\):

\[\ell_i \sim \lfloor -\ln(U_i) \cdot m_L \rfloor, \quad U_i \sim \text{Uniform}(0, 1)\]

where \(m_L = 1/\ln(M)\) ensures that the expected number of elements at level \(\ell\) decreases geometrically. The expected fraction of vectors present at level \(\ell\) is \(e^{-\ell/m_L}\).

Insertion algorithm

To insert a new vector \(\mathbf{v}\) into the HNSW index:

  1. Draw the maximum level \(\ell_v\) from the level distribution above.
  2. Starting from the entry point of the topmost layer, greedily descend through layers \(L_{\max}, L_{\max}-1, \ldots, \ell_v + 1\), at each layer finding the single nearest neighbor and using it as the entry point for the next layer.
  3. At layers \(\ell_v, \ell_v - 1, \ldots, 0\), perform a beam search with beam width \(\texttt{ef\_construction}\): maintain a dynamic list of the \(\texttt{ef\_construction}\) closest candidates found so far; expand each candidate’s neighbors; update the list.
  4. From the beam search results, select the \(M\) nearest neighbors and add bidirectional edges. Apply a heuristic pruning step that prefers edges that “spread out” the neighborhood (the key innovation over naive \(k\)-NN graph construction).

The ef_construction parameter controls the trade-off between index quality (higher ef_construction = better edges, higher recall) and build time (linear in ef_construction). Typical production values: M = 16, ef_construction = 200.

Query algorithm

To query for the \(k\) nearest neighbors of \(\mathbf{q}\):

  1. Enter the graph at the stored entry point at the topmost layer.
  2. Greedily descend through layers \(L_{\max}, \ldots, 1\), finding the nearest neighbor at each layer and using it as the entry point for the next.
  3. At layer 0, run beam search with beam width \(\texttt{ef\_search} \geq k\): expand the frontier of the \(\texttt{ef\_search}\) best candidates found so far, stopping when the frontier cannot be improved.
  4. Return the top \(k\) from the beam search result set.

The ef_search parameter is the primary lever for the recall/latency tradeoff at query time: ef_search = k gives maximum speed (lowest recall); ef_search = 1000 approaches exact search (high recall, high latency). This can be adjusted without rebuilding the index.

Complexity. HNSW achieves expected query complexity \(O(\log N)\) for the upper-layer descent and \(O(\texttt{ef\_search} \cdot M)\) for the base-layer beam search. At \(N = 10^9\) and \(M = 16\), ef_search = 100, this is roughly \(O(\log 10^9 + 1600) \approx O(1630)\) operations — versus \(O(10^9)\) for linear scan. The improvement is six to eight orders of magnitude.

Live demo: HNSW from scratch

The cell below implements a minimal two-layer HNSW on 200 random vectors in \(\mathbb{R}^{32}\). It inserts all vectors, executes a query, traces the navigation path, and computes Recall@10 against brute-force exact search.

Interpretation. The layer population bar chart shows the geometric decay property of HNSW: nearly all vectors appear at layer 0; only a handful reach the upper layers, forming the “highway network” that enables fast coarse navigation. The degree distribution at the base layer reflects the bidirectional edge additions and the pruning to \(2M\) — most nodes have degree close to \(M\), with a tail from heavily-queried hub nodes. The recall figure — typically 0.80–1.00 for this small demo — illustrates that even a minimal implementation achieves strong performance; production HNSW with ef_construction = 200 routinely reaches Recall@10 \(\geq 0.97\).

In practice: Pinecone’s $750M valuation

Pinecone, founded in 2019 by former AWS engineers, is the dominant managed vector database in the enterprise market. Its $750 million valuation (Series B, 2023, led by Andreessen Horowitz) rests on a single insight: most ML teams do not want to operate a vector database — they want to call an API. Pinecone abstracts away index management, replication, sharding across billions of vectors, real-time update ingestion, and multi-tenant isolation behind a simple Python client: index.upsert(vectors) and index.query(vector=..., top_k=10). The underlying algorithm is a proprietary HNSW variant with additional graph filtering optimisations. Pinecone’s differentiation is operational simplicity and SLA guarantees (\(99.99\)% uptime), not algorithmic novelty. For practitioners: Pinecone is the right choice when your team has ML expertise but not distributed-systems expertise, and when the $70–500/month operational cost is acceptable relative to engineering time saved.


IVF: Inverted File Index

Partitioning the vector space

The Inverted File Index (IVF) approach takes a fundamentally different strategy from HNSW: instead of building a navigable graph, it partitions the vector space into \(C\) clusters using k-means, then stores each corpus vector in the partition whose centroid it is nearest to. At query time, rather than scanning all \(N\) vectors, it scans only the vectors in the \(n_{\text{probe}}\) partitions whose centroids are nearest to the query.

The build procedure:

  1. Run k-means on the corpus with \(C\) centroids \(\{\mu_1, \ldots, \mu_C\}\).
  2. Assign each corpus vector \(\mathbf{v}_i\) to its nearest centroid: \(c(i) = \arg\min_j \|\mathbf{v}_i - \mu_j\|_2\).
  3. Store an inverted list \(L_j = \{i : c(i) = j\}\) for each centroid.

The query procedure:

  1. Compute distances from the query \(\mathbf{q}\) to all \(C\) centroids: \(O(Cd)\).
  2. Select the \(n_{\text{probe}}\) nearest centroids.
  3. Scan all vectors in the corresponding inverted lists: expected \(n_{\text{probe}} \cdot N/C\) vectors.
  4. Return the top \(k\) from the scanned set.

Total cost per query: \(O(Cd + n_{\text{probe}} \cdot N/C \cdot d)\). With optimal \(C = \sqrt{N}\), this gives \(O(\sqrt{N} \cdot d)\) — a substantial reduction from \(O(Nd)\).

The probe constraint for a given recall target \(R\) can be estimated empirically: start with \(n_{\text{probe}} = 1\), increase until the measured Recall@\(k\) exceeds \(R\). In practice, \(n_{\text{probe}} = 10\)–\(64\) achieves Recall@10 \(\geq 0.90\) for most text embedding distributions. The Faiss library recommendation is \(C \approx 4\sqrt{N}\) for best performance.

Live demo: IVF from scratch

Interpretation. The recall/scan curve shows the IVF tradeoff in its simplest form: at \(n_{\text{probe}} = 1\), the method is maximally fast but misses many true neighbors. As \(n_{\text{probe}}\) grows, recall climbs toward 1.0, but the fraction of the corpus scanned grows proportionally. The operational question is: where on this curve does your latency budget allow you to sit? In production, with \(C = 4\sqrt{N}\) and \(n_{\text{probe}}\) calibrated for Recall@10 \(\geq 0.90\), IVF scans roughly 1–5% of the corpus per query — a 20–100× speedup over linear scan.


Product Quantization

The memory problem

Even after IVF reduces the number of vectors scanned per query, the scanned vectors must still be stored and loaded into memory. At \(N = 10^9\), \(d = 768\), FP32: \(10^9 \times 768 \times 4 = 3.07\) TB. Even at FP16 (\(16\)-bit): \(1.54\) TB. This vastly exceeds the RAM of any single server (typically 256–512 GB) and requires distributed storage with expensive cross-node communication.

Product Quantization (PQ, Jégou et al., 2011) solves the memory problem by compressing each vector from \(d \times 32\) bits down to \(m \times \log_2 k_{\text{sub}}\) bits — a compression ratio of \(\frac{32d}{m \log_2 k_{\text{sub}}}\).

The PQ construction

Split each \(d\)-dimensional vector into \(m\) contiguous sub-vectors of dimension \(d' = d/m\). Train a separate k-means quantizer on each sub-space with \(k_{\text{sub}}\) centroids. The quantizer for sub-space \(j\) maps a \(d'\)-dimensional sub-vector to one of \(k_{\text{sub}}\) code indices \(c_j \in \{0, \ldots, k_{\text{sub}}-1\}\).

A full \(d\)-dimensional vector \(\mathbf{v}\) is encoded as a tuple of \(m\) code indices \((c_1, c_2, \ldots, c_m)\), requiring \(m \cdot \log_2 k_{\text{sub}}\) bits. With \(m = 8\) and \(k_{\text{sub}} = 256\) (\(= 2^8\), so 1 byte per code):

\[\text{bytes per vector} = m \times 1 = 8 \text{ bytes instead of } 4d = 3072 \text{ bytes (at d=768)}\]

This is a 384:1 compression ratio. At \(N = 10^9\), the PQ-compressed index fits in \(8 \times 10^9 = 8\) GB of RAM.

Asymmetric distance computation

During query, we compute the distance from the query \(\mathbf{q}\) to a PQ-encoded corpus vector \(\tilde{\mathbf{v}} = (\mu_{1,c_1}, \mu_{2,c_2}, \ldots, \mu_{m,c_m})\) where \(\mu_{j,c}\) denotes centroid \(c\) in sub-space \(j\):

\[d(\mathbf{q}, \tilde{\mathbf{v}})^2 \approx \sum_{j=1}^{m} \|\mathbf{q}_j - \mu_{j,c_j}\|^2\]

The trick that makes this fast is asymmetric computation: before scanning, precompute a distance lookup table \(T\) where \(T[j][c] = \|\mathbf{q}_j - \mu_{j,c}\|^2\) for all sub-spaces \(j\) and all centroids \(c\). This table has \(m \times k_{\text{sub}}\) entries and takes \(O(m \times k_{\text{sub}} \times d')\) to build — a one-time cost per query. Then each corpus vector distance reduces to \(m\) table lookups and additions:

\[d(\mathbf{q}, \tilde{\mathbf{v}})^2 \approx \sum_{j=1}^{m} T[j][c_j]\]

This is extraordinarily cache-friendly: the lookup table fits in L2 cache, and processing each compressed vector requires only \(m = 8\) memory reads and \(m\) additions. Throughput reaches hundreds of millions of distance computations per second on a single CPU core.

Combined IVF-PQ: the billion-scale workhorse

The practical algorithm at billion-scale combines both techniques:

  1. IVF partitions the space; at query time, only \(n_{\text{probe}}\) partitions are scanned.
  2. PQ compresses the vectors within each partition; scanning uses asymmetric distance computation on the compressed codes.
  3. Re-ranking: the top \(k' > k\) candidates from the PQ scan are re-ranked using exact distances computed on the original full-precision vectors (stored separately or fetched from object storage).

IVF-PQ is the default billion-scale algorithm in Facebook AI Similarity Search (Faiss) and the primary algorithm in Milvus, Vespa, and Weaviate’s optional GPU index.

Live demo: PQ from scratch

Interpretation. The reconstruction error histogram shows that PQ introduces a moderate approximation error — each vector is replaced by the concatenation of \(m\) sub-space centroids. The scatter plot of exact vs. PQ distances shows that the rank ordering is largely preserved (points near the diagonal), with some noise that causes the recall loss. The dual-axis sweep across \(m\) values reveals the core tradeoff: more sub-spaces means less compression (each sub-vector gets its own codebook, capturing more information) but also less compression ratio. At \(m = d\) (no sub-spacing), PQ degenerates to scalar quantization. The sweet spot in production is \(m = 8\) or \(m = 16\) with \(k_{\text{sub}} = 256\), giving 8 or 16 bytes per vector and Recall@10 \(\geq 0.85\) before re-ranking.


GPU-Accelerated ANN

Why GPUs change the game

The distance computation at the heart of ANN search — computing \(\|\mathbf{q} - \mathbf{v}_i\|\) for many vectors simultaneously — is a matrix-vector multiplication when expressed as a batch. For a batch of queries \(Q \in \mathbb{R}^{B \times d}\) and corpus \(V \in \mathbb{R}^{N \times d}\), the dot-product similarity matrix is \(S = QV^\top \in \mathbb{R}^{B \times N}\), computed in a single GPU kernel at peak throughput. An A100 GPU executes 312 TFLOPs in FP16, versus ~8 TFLOPs on a high-end CPU. This translates directly: for batch queries, GPU search is 10–100× faster than CPU search.

Three major GPU ANN libraries dominate as of 2026:

Faiss-GPU (Meta AI). The GPU variant of Faiss supports IVF-Flat, IVF-PQ, and flat exact search entirely on GPU. Index build and query both run on GPU. The primary limitation is that the full-precision flat index must fit in GPU VRAM — at \(d = 768\), this limits to approximately 15M vectors on an 80GB A100. IVF-PQ lifts this by compressing vectors, supporting hundreds of millions of vectors in 80GB.

CAGRA (NVIDIA, 2023). CAGRA (CUDA Approximate Graph-based Retrieval Algorithm) is a GPU-native HNSW variant from NVIDIA’s cuVS library. It constructs and queries a proximity graph entirely on GPU, achieving 10–50× speedup over CPU HNSW for batch queries. CAGRA builds the graph using GPU-accelerated k-NN construction and query with GPU-parallelised beam search.

ScaNN (Google, 2020). ScaNN (Scalable Nearest Neighbors) uses asymmetric hashing and anisotropic quantization — penalising quantization error on the component of the vector parallel to the query direction more than the perpendicular component, since parallel errors directly affect distance estimates. ScaNN holds multiple ANN benchmark records on the ann-benchmarks leaderboard.

This block requires faiss-gpu and a CUDA GPU — it does not run in your browser

Run on a Google Colab Pro instance with GPU runtime enabled. Install with pip install faiss-gpu-cu12 (for CUDA 12.x).

import faiss
import numpy as np

N = 1_000_000   # 1M vectors
d = 768         # BERT-large embedding dimension
k = 10

# Generate random corpus and queries
np.random.seed(42)
corpus  = np.random.randn(N, d).astype(np.float32)
faiss.normalize_L2(corpus)   # in-place L2 normalisation

queries = np.random.randn(1000, d).astype(np.float32)
faiss.normalize_L2(queries)

# ── Build IVF-PQ index on GPU ──────────────────────────────────
res = faiss.StandardGpuResources()  # GPU resource manager

nlist  = 1024    # IVF partitions (~sqrt(N) as rule of thumb)
M_pq   = 16     # PQ sub-spaces
nbits  = 8      # bits per sub-space code (k_sub = 256)

# Build CPU index first, then move to GPU
quantizer = faiss.IndexFlatIP(d)   # inner-product quantizer (cosine after normalise)
cpu_index = faiss.IndexIVFPQ(quantizer, d, nlist, M_pq, nbits)

# GPU co-training
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)

print("Training IVF-PQ on GPU...")
gpu_index.train(corpus)
print("Adding vectors...")
gpu_index.add(corpus)
print(f"Index trained: {gpu_index.ntotal} vectors, {nlist} partitions, "
      f"M_pq={M_pq}, nbits={nbits}")

# ── Query ─────────────────────────────────────────────────────
gpu_index.nprobe = 32   # search 32 partitions per query
distances, indices = gpu_index.search(queries, k)

print(f"\nSearched 1,000 queries (top-{k})")
print(f"Mean distance of nearest neighbor: {distances[:, 0].mean():.4f}")
print(f"Sample result for query 0: indices = {indices[0]}, dists = {distances[0].round(4)}")

# ── Memory footprint report ────────────────────────────────────
bytes_flat_fp32 = N * d * 4
bytes_ivfpq     = N * M_pq * (nbits // 8)   # PQ codes only
print(f"\nMemory comparison for N={N:,} vectors, d={d}:")
print(f"  Flat FP32 index:  {bytes_flat_fp32 / 1e9:.2f} GB")
print(f"  IVF-PQ index:     {bytes_ivfpq     / 1e9:.2f} GB  ({M_pq} bytes/vector)")
print(f"  Compression:      {bytes_flat_fp32 / bytes_ivfpq:.0f}:1")

At \(N = 10^9\), Faiss-GPU with IVF-PQ (\(M = 16\), nbits = 8) on 8× A100 80GB GPUs achieves approximately 1–2 ms per batch of 100 queries, with Recall@10 \(\approx 0.92\) at nprobe = 64. This is the hardware configuration behind systems like Meta’s content recommendation index and Google’s large-scale image search.


Production Vector Databases

The ecosystem in 2026

The vector database market has stratified into three tiers: fully managed cloud services, self-hosted open-source systems, and vector-capable extensions to existing databases.

Pinecone is the dominant fully managed offering. Multi-tenant, real-time upserts, automatic replication and sharding, serverless pricing model (pay per query and storage). The primary choice for teams that want zero operational overhead. Pod-based and serverless tiers; serverless scales to zero when idle, making it cost-effective for development.

Weaviate is an open-source vector database (Go, Apache 2.0) with a unique hybrid architecture: every object has a structured data schema (like a document store) and a vector representation, and queries can combine GraphQL-style filters with vector search. Supports HNSW natively; integrates directly with Cohere, OpenAI, and HuggingFace embedding APIs via “modules.” Weaviate Cloud Services (WCS) is the managed offering.

Qdrant is an open-source vector database (Rust) with a strong emphasis on performance and filtering. Qdrant’s payload filtering during ANN search — where a filter is applied simultaneously with the graph traversal, not as a post-filter — is one of the most efficient implementations in the market. Ideal when filtering semantics are complex and correctness matters.

Milvus is an open-source distributed vector database (Go/C++) designed for billion-scale deployments. Supports multiple index types (HNSW, IVF-PQ, DiskANN, CAGRA), streaming ingestion via Kafka, and disaggregated storage with object storage (S3/MinIO) backends. Managed offering via Zilliz Cloud. The standard choice for self-hosted billion-scale deployments.

Chroma is an embedded, developer-friendly vector database written in Python with a Rust backend. Designed for rapid prototyping: import chromadb; client = chromadb.Client() and you have a local vector store in five lines. Does not scale beyond single-node, but is the fastest path from idea to running RAG prototype.

pgvector is a PostgreSQL extension that adds a vector column type and an approximate nearest-neighbor index (HNSW or IVF-Flat via ivfflat). The crucial advantage: it runs inside your existing PostgreSQL instance, with full ACID compliance, JOINs, and SQL filtering. The fastest-growing vector solution in 2024–2025 because it eliminates the operational overhead of a separate vector database entirely. Throughput is lower than dedicated systems at very large scale, but at \(N < 10^7\) with low QPS, pgvector is often the optimal choice.

Elastic / OpenSearch / Vespa are general search engines that have added dense vector search alongside keyword search. Elastic’s HNSW implementation (knn queries) is production-grade and integrates with Elastic’s mature observability, security, and filtering infrastructure. Vespa is the most feature-complete, supporting hybrid retrieval, tensor computation, and machine-learned ranking natively.

Specialised systems. TurboPuffer stores vectors in object storage (S3) with a novel compression scheme, achieving low cost at the expense of latency. LanceDB uses columnar storage (Lance format) optimised for the full lifecycle of an embedding pipeline: store, query, and re-embed. Marqo is multimodal-first, combining vision and text embedding at ingestion.

In practice: pgvector adoption surge

Between January 2024 and March 2025, the GitHub star count for pgvector grew from approximately 8,000 to over 14,000 — faster growth than any dedicated vector database during the same period. The driver is the RAG adoption wave: most engineering teams already run PostgreSQL; adding CREATE EXTENSION vector and a single additional column costs nothing operationally and introduces zero new infrastructure to monitor. Supabase (which offers hosted PostgreSQL) now defaults to pgvector for all new projects, making vector search a one-click feature for over 1.5 million Supabase users. The implication for practitioners: for datasets under ~5 million vectors with query loads under ~1,000 QPS, pgvector with HNSW indexing is frequently the correct engineering answer — not because it is the fastest, but because the operational simplicity compounds over time.

In practice: Anthropic’s RAG architecture

Anthropic’s public documentation (2024) describes their internal RAG pipeline for Claude as using a hybrid retrieval architecture: dense retrieval via an internal embedding model (similar in architecture to Voyage AI’s voyage-2, for which Anthropic has a commercial partnership) combined with BM25 sparse retrieval over the same corpus, fused via Reciprocal Rank Fusion. The vector index is updated nightly via a shadow-write pattern: a new index is built in the background, and traffic is cut over when the new index reaches parity with the old one. This blue-green index swap pattern avoids query-time inconsistency during re-indexing. The specific vector database vendor is not disclosed; all public descriptions are architecture-level. The Voyage AI partnership is public: Anthropic bundles Voyage embedding API credits with Claude API subscriptions for RAG use cases.


Hybrid Search — Vector + Keyword + Filter

Why dense-only retrieval fails

Dense vector search retrieves documents that are semantically related to the query — but it can miss documents that are lexically exact matches. A query for “pgvector HNSW index build time” will find semantically adjacent content about vector databases and PostgreSQL, but may miss the specific blog post or documentation page that contains the exact phrase “HNSW index build time” if that document’s embedding happens to be slightly displaced in vector space.

Sparse retrieval methods, particularly BM25 (Best Match 25, Robertson et al. 1994), are designed for exact lexical matching. BM25 scores document \(d\) for query \(q = (q_1, \ldots, q_n)\) as:

\[\text{BM25}(d, q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, d) \cdot (k_1 + 1)}{f(q_i, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}\]

where \(f(q_i, d)\) is the term frequency of query term \(q_i\) in document \(d\), \(|d|\) is document length, \(\text{avgdl}\) is average document length across the corpus, \(k_1 \in [1.2, 2.0]\) controls term frequency saturation, and \(b \in [0, 1]\) controls length normalisation. BM25 does not require any model training; it is a pure statistical formula over token frequencies.

Production RAG systems combine both: dense retrieval for semantic coverage and BM25 for lexical precision. The fusion step uses Reciprocal Rank Fusion (RRF, Cormack et al. 2009):

\[\text{RRF}(d, \{R_1, R_2, \ldots, R_L\}) = \sum_{\ell=1}^{L} \frac{1}{k_{\text{rrf}} + r_\ell(d)}\]

where \(r_\ell(d)\) is the rank of document \(d\) in ranking \(\ell\), and \(k_{\text{rrf}} = 60\) is the standard smoothing constant (preventing very high scores from rank-1 results from dominating). RRF requires no score calibration: it operates purely on ranks, making it robust to the different score scales of BM25 and cosine similarity.

Live demo: RRF hybrid search

Interpretation. The rank table shows that dense retrieval finds the planted semantically-similar documents even when they share no keywords with the query, while BM25 surfaces documents containing the exact query terms regardless of semantic distance. RRF combines both signals: documents appearing in the top ranks of either list receive high RRF scores, while documents ranked low by both are suppressed. The Jaccard overlap figures quantify the complementarity: dense and BM25 top-10 sets typically share only 30–50% of items, confirming that the two methods capture genuinely different relevance signals. RRF consistently outperforms either method alone on information retrieval benchmarks (TREC, BEIR) precisely because this complementarity is reliable across domains.


Metadata Filtering

The problem of constrained search

In production, most vector search queries include structured constraints alongside the semantic query: “find tweets semantically similar to this one, posted between January 1 and March 31, by users with more than 10,000 followers, in English.” The structured constraints are metadata filters; the semantic query is the vector search. Combining the two correctly is one of the most operationally important and frequently underestimated problems in vector database design.

Post-filtering is the naive approach: run ANN search to retrieve the top \(k' \gg k\) candidates, then apply the metadata filter to select the \(k\) that pass. The problem: if the filter is selective (only 1% of the corpus passes), the ANN search must return 100× more candidates to have a reasonable probability of finding \(k\) passing items. This inflates latency proportionally and may require scanning most of the corpus.

Pre-filtering applies the metadata filter first, producing a subset of the corpus indices, then runs ANN search only within that subset. This requires the index to support efficient subsetting — not all ANN indexes do. Qdrant’s payload filtering implements pre-filtering by maintaining a separate HNSW graph per metadata value (or a single graph with skip logic), allowing the graph traversal to ignore filtered-out nodes. The result: at a filter selectivity of 1%, pre-filtering maintains full ANN throughput rather than degrading by 100×.

Hybrid partition + filter. Some systems (Milvus, Vespa) support partitioning the index by metadata field (e.g., one partition per language, one per date range) and routing queries to the relevant partitions. This combines the coarse filtering of IVF partitioning with metadata constraints, enabling efficient compound queries at scale.

The practical rule: if your filter selectivity is above 10% (most queries), post-filtering with oversampling is adequate. Below 5% selectivity, pre-filtering is required to maintain latency targets. Qdrant and Weaviate are the most mature open-source implementations of pre-filtering as of 2026.


Embedding Drift and Re-Indexing

The terabyte re-index problem

Embedding models improve on a roughly six-to-twelve month cadence. A system deployed in 2023 using text-embedding-ada-002 (1536-dim) faces a migration to text-embedding-3-small (1536-dim, better performance, lower cost) or multilingual-e5-large (1024-dim, multilingual). The problem: embeddings from different models are not comparable — you cannot query an index built with model A using a query embedding from model B. Every vector in the index must be re-embedded.

At \(N = 10^9\), \(d = 768\), the re-embedding cost is:

  • Computation: \(10^9\) API calls to an embedding service. At OpenAI’s text-embedding-3-small pricing of $0.02/million tokens, and assuming an average document length of 200 tokens: \(10^9 \times 200 / 10^6 \times \$0.02 = \$4{,}000\) in embedding cost alone, plus the cost of rebuilding the ANN index.
  • Time: even at 10,000 documents/second throughput, embedding 1B documents takes ~28 hours. Index build adds another 2–10 hours.
  • Storage: the old index (up to 3 TB) must coexist with the new index during the transition.

Versioning strategies

Three patterns manage the re-indexing transition without serving degradation.

Dual-write with shadow read. During the migration period, new documents are indexed under both the old model (index A) and the new model (index B). Queries continue to be served from index A. At query time, a shadow read also queries index B; the results are logged for quality monitoring but not served to users. When index B reaches parity (measured by offline recall metrics against a ground-truth relevance set), traffic is cut over to index B and index A is torn down.

Blue-green index swap. Maintain two complete indexes (blue = live, green = under construction). Build the new index in the green slot while serving from blue. When green is complete and validated, a router update switches 100% of traffic from blue to green in a single atomic step. Blue is retained for one rollback window (typically 24–48 hours), then deleted.

Incremental re-embedding with versioned lookup. Rather than rebuilding the entire index, re-embed only documents that score below a drift threshold (measured by the change in their embedding vector relative to the previous model). This requires storing the previous-model embeddings alongside the current ones and running a periodic drift audit. The strategy works best when model updates are incremental (fine-tuning rather than architecture change) and when drift is concentrated in a small fraction of the corpus.

In practice: the BGE embedding family migration

The BAAI (Beijing Academy of Artificial Intelligence) BGE embedding family illustrates the re-indexing lifecycle. BGE-v1.5 (2023) was widely adopted for multilingual retrieval tasks. BGE-m3 (2024) introduced a fundamentally different architecture — hybrid dense/sparse/colbert retrieval, with dimension increasing from 768 to 1024 — requiring a complete re-index for any system using BGE-v1.5. Production deployments at several Chinese AI companies reported re-indexing times of 3–7 days for corpora in the hundreds of millions of documents, with dual-write patterns running for two weeks to ensure quality parity before cutover. The lesson: embedding model upgrade is not a zero-cost operation, and systems should be designed from day one with re-indexing in mind — including the capacity to run two indexes simultaneously and the tooling to measure recall before and after migration.


Operational Realities

The production checklist

Building a vector search system that works in a notebook is straightforward. Running one reliably in production at scale requires confronting a set of operational realities that are rarely discussed in algorithm papers.

Index size vs. RAM. A flat FP32 index for \(N = 10^9\), \(d = 768\) requires 3 TB of RAM — the cost of approximately 15–20 high-memory cloud instances (\(\sim\)$15,000–50,000/month). IVF-PQ with \(m = 16\), nbits = 8 reduces this to 16 GB — a single machine. The choice of quantization level directly determines your infrastructure bill. Most production systems use FP16 (half precision) for the full-precision store (halving memory) and PQ for the searchable index (further 4–16× reduction), with the full-precision store used only for re-ranking the top 4–8× candidates from the PQ pass.

Cold-start latency. A freshly started vector database instance must load the index from disk or object storage into RAM before serving queries. At 3 GB/s sequential read speed (NVMe SSD), loading a 16 GB IVF-PQ index takes ~5 seconds — acceptable. Loading a 300 GB HNSW index takes 100 seconds — unacceptable for a stateless serverless architecture. This is why serverless vector databases (Pinecone serverless, pgvector on Neon) use aggressive memory-mapping and pre-warming strategies, and why cold-start latency is a real SLA consideration for infrequently-accessed tenants.

Multi-tenancy isolation. A SaaS application may need to serve 10,000 customers, each with their own document corpus, in a single vector database instance. The two dominant patterns are: namespace isolation (Pinecone) — all customers share the same physical index but queries are filtered to the customer’s namespace; and index isolation — each customer gets a separate HNSW or IVF index. Namespace isolation is cheaper but allows cross-namespace leakage risks; index isolation is secure but requires managing thousands of small indexes with their own memory footprints.

Real-time updates. HNSW supports real-time insertions and deletions without rebuilding. IVF requires periodic re-clustering if the data distribution shifts significantly. The practical implication: for corpora with heavy write traffic (a live social media stream ingesting thousands of posts per second), HNSW is the more operationally tractable index type. For analytics corpora with infrequent batch updates, IVF-PQ’s better compression makes it more attractive.

Cost at scale. The following table gives realistic cost estimates for a 1B-vector index at \(d = 768\) in a managed cloud environment as of 2026:

Configuration Memory required Estimated monthly cost
Flat FP32, replicated ~6 TB $50,000+
IVF-PQ (m=16), replicated ~32 GB $5,000–10,000
pgvector HNSW on managed Postgres ~200 GB (HNSW overhead) $3,000–8,000
Pinecone serverless N/A (managed) $5,000–30,000 (query-dependent)
Milvus self-hosted on 8× A100 ~300 GB GPU RAM $15,000–40,000

These figures are directional. The dominant cost driver at billion-scale is not compute but memory — and the primary lever on memory is quantization. Every bit removed from the representation reduces the infrastructure bill proportionally.


Mini Case Study — Semantic Search for Social Media

The system

Goal. Build a production semantic search engine over 100 million tweets, supporting queries by text, image, and author metadata, with a throughput target of 10,000 queries per second at p99 \(< 50\) ms.

Data. Each tweet record includes: text (up to 280 characters), image (optional, 60% of tweets), author metadata (follower count, verified status, language, posting timestamp), and engagement metrics (likes, retweets, replies).

Embedding architecture

Three embedding models produce the vector representations:

  1. Text embedding: multilingual-e5-large (Wang et al., 2024), 560M parameters, 1024-dimensional output, supports 100+ languages. Average encoding throughput on A100: ~5,000 tweets/second. At 100M tweets: ~5.6 hours to encode. Storage: \(100M \times 1024 \times 2\) bytes (FP16) \(= 204.8\) GB for the raw text embeddings.

  2. Image embedding: CLIP ViT-L/14 (OpenAI), 768-dimensional output, covers 60% of tweets (\(= 60M\) images). Storage: \(60M \times 768 \times 2 = 92.2\) GB.

  3. Metadata features: author follower count (log-scaled), verified flag, language one-hot (20 languages), hour-of-week (168-dim one-hot), engagement normalised by follower count. Combined into a 200-dimensional structured feature vector, stored in a standard relational store (PostgreSQL) for filter operations rather than included in the ANN index.

Index design

Text index — IVF-PQ: - Partitions: \(C = 4\sqrt{10^8} \approx 40{,}000\) (Faiss recommendation for this scale). - Product Quantization: \(m = 16\) sub-spaces, \(k_{\text{sub}} = 256\) (\(= 8\) bits per code), so \(16\) bytes per tweet. - Total IVF-PQ index size: \(100M \times 16\) bytes \(= 1.6\) GB. Fits comfortably in memory on a single server. - Full-precision residuals for re-ranking: store top-\(k'\) candidates in FP16 separately; \(k' = 4k\) is typical. - Recall@10 target: \(\geq 0.92\) at nprobe = 64.

Image index — IVF-PQ: - \(N = 60M\), \(d = 768\), same \(m = 16\), \(k_{\text{sub}} = 256\). - Index size: \(60M \times 16\) bytes \(= 960\) MB. Single server.

Hybrid query pipeline: at query time, a text query is embedded via multilingual-e5-large and a BM25 index is queried over the tokenized tweet texts. The top-500 from each are fused via RRF. Optionally, an image query can be embedded via CLIP and added as a third ranking signal, fused into the same RRF step. Metadata filters (language, date range, minimum follower count) are applied as post-filters on the RRF-fused top-200, which at a typical selectivity of 20–30% yields the required \(k = 10\) results.

Throughput analysis

Target: 10,000 QPS. Each query pipeline step:

Step Latency Parallelisable?
Text embedding (GPU) ~2 ms Yes (batch 32 queries)
IVF-PQ text search ~3 ms Yes (nprobe=64, in RAM)
BM25 retrieval ~1 ms Yes (inverted index)
RRF fusion <0.5 ms Trivial
Image ANN search (optional) ~3 ms Yes
Post-filter + re-rank ~1 ms Yes
Total p99 (text only) ~8–15 ms
Total p99 (text + image) ~15–30 ms

At 10,000 QPS with 15ms average latency: the system needs ~150 concurrent queries in flight simultaneously. A cluster of 4 servers (each handling 2,500 QPS) with the IVF-PQ index replicated to each server achieves the throughput target. Infrastructure cost: 4 × high-memory servers + GPU for embedding = approximately $8,000–15,000/month on cloud.

Failure modes

Embedding quality degradation at the long tail. Tweets using slang, emoji-heavy language, or code-switching (mixing two languages within a tweet) are systematically underrepresented in the multilingual-e5-large training corpus. Retrieval quality for these tweets will be lower than the aggregate Recall@10 metric suggests. Mitigation: monitor recall separately by language and emoji density; augment the fine-tuning corpus for under-represented patterns.

Index staleness. At 100M tweets ingested once, the index is static. A live system ingesting 50M new tweets per day requires a streaming HNSW update pipeline or periodic IVF-PQ rebuilds. Rebuilding IVF-PQ for 100M tweets takes approximately 4–6 hours on 8× A100 GPUs — a nightly rebuild is feasible but introduces a 24-hour freshness lag for newly ingested tweets. Real-time streaming systems must use HNSW (which supports online insertion) at the cost of 3–5× higher memory per vector.

Metadata filter cliff. If a user queries with a narrow date range (e.g., a single hour), the post-filter will find too few results passing the filter even after retrieving 200 ANN candidates. The system must detect this “filter cliff” condition and fall back to a broader retrieval or a relaxed date constraint, surfacing an explanatory message to the user.


Closing — Vector DBs as Infrastructure

The infrastructure generation shift

The history of data infrastructure can be read as a series of transitions, each driven by a new class of application that existing systems could not serve:

  • 1970s–1990s: Relational databases. The structured data era. SQL, normalization, ACID transactions. Built for the application pattern: structured records with known schema, transactional updates, JOINs. The RDBMS became the default first choice for any new application.

  • 2000s–2010s: NoSQL databases. The unstructured data era. Document stores (MongoDB), key-value stores (Redis, DynamoDB), column stores (Cassandra), graph databases (Neo4j). Built for the application pattern: flexible schema, horizontal scalability, eventual consistency acceptable. The NoSQL movement did not replace the RDBMS; it extended the toolkit.

  • 2020s: Vector databases. The semantic data era. Dense float arrays, approximate nearest-neighbor search, trillion-parameter model embeddings. Built for the application pattern: semantic similarity, RAG, recommendation, multimodal retrieval. Vector databases will not replace RDBMS or NoSQL; they extend the toolkit again.

The strategic question — buy vs. build, managed vs. self-hosted — is currently resolving in favor of managed services at the small-to-medium scale (up to ~100M vectors) and self-hosted systems at the large scale (1B+ vectors). The consolidation trajectory is toward existing database vendors absorbing vector capabilities: Postgres already has pgvector; Elastic already has dense retrieval; MongoDB added Atlas Vector Search in 2023. By 2028, standalone vector databases may be a niche product used only at billion-scale or for specialized multimodal workloads, while vector search will be a feature available in every major data platform.

What will not change: every analytics system, search system, recommendation system, agent memory, and RAG pipeline of consequence is built on vector similarity search as a foundational primitive. The models generating the embeddings will improve; the distances they are measured by will remain cosine, Euclidean, and dot product; and the algorithms navigating the resulting high-dimensional spaces will continue to balance the same irreducible tradeoff between recall, latency, and memory that the entire field has navigated since Faiss’s first public release in 2017.

Understanding this infrastructure — not just how to call an API, but what the API does mathematically, and why the tradeoffs are the way they are — is the difference between a practitioner who can deploy a vector search system and one who can diagnose it when it fails, tune it when it drifts, and re-architect it when the scale changes. That understanding is what this chapter has built.

In practice: Weaviate’s hybrid architecture in production

Weaviate’s hybrid search architecture, as deployed by Canva (design platform, ~170M users) for their asset search feature, combines CLIP image embeddings with BM25 over asset tags and descriptions. At Canva’s scale (~10B design assets, with a searchable subset of ~500M), Weaviate is deployed in a self-hosted Kubernetes cluster with 48 nodes, each holding a shard of the HNSW index in RAM. Canva’s engineering blog (2024) reports Recall@10 of 0.94 on their internal benchmark at p99 latency of 35ms, serving 2,000 QPS from the vector search tier. The hybrid retrieval pipeline (text BM25 + CLIP vector + metadata filter) improved click-through rate on search results by 18% relative to keyword-only search in their A/B test — the clearest production validation that hybrid retrieval is not a theoretical nicety but a measurable business improvement.

Consider the IVF-PQ configuration used in the case study: \(N = 10^8\) tweets, \(d = 1024\), \(m = 16\), \(k_{\text{sub}} = 256\), \(C = 40{,}000\) partitions. Before computing: what is the expected Recall@10 at \(n_{\text{probe}} = 1\) (only the single nearest partition is searched)? And what fraction of the corpus does \(n_{\text{probe}} = 64\) scan? Estimate both, then verify with the analysis below.

Interpretation. The probe analysis confirms the earlier theoretical argument: at \(n_{\text{probe}} = 1\), each query scans only \(0.0025\)% of the corpus — a 40,000× speedup over linear scan — at the cost of very low recall. At \(n_{\text{probe}} = 64\), recall reaches approximately \(0.90\)–\(0.95\), scanning only \(0.16\)% of the corpus. The storage figures confirm the case study’s estimates: 1.6 GB for the PQ-coded index versus 205 GB for the raw FP16 store — a 128:1 compression ratio. This is the engineering arithmetic that makes billion-scale vector search financially viable.


Prof. Xuhu Wan  ·  HKUST  ·  Modern AI Stack for Social Data  ·  2026 Edition

 

Prof. Xuhu Wan · HKUST · Modern AI Stack for Social Data