Chapter 9: Vector Databases and Embedding Infrastructure
About This Chapter
Every chapter until now has treated the embedding as an endpoint: you pass text through a transformer encoder, collect the dense vector that emerges, and use it as a feature for some downstream classifier or generative model. That framing is correct as far as it goes. But it omits the most operationally consequential question: once you have a billion embeddings, how do you find the one that is most similar to a new query in under fifty milliseconds?
The answer is a vector database — a storage and retrieval system purpose-built for high-dimensional floating-point vectors. The category barely existed in 2019. By 2025, Pinecone had raised $138 million at a $750 million valuation, Weaviate had raised $68 million at approximately $200 million, and Chroma had raised $20 million at seed. More telling than the venture numbers: pgvector, a free PostgreSQL extension that adds a vector similarity operator to ordinary SQL, became the fastest-growing PostgreSQL extension of 2024–2025, with adoption accelerating alongside every new RAG (Retrieval-Augmented Generation) deployment. The reason for all of this investment and adoption is simple: embeddings are useless without a way to find the nearest neighbor at scale.
This chapter builds the full picture from the ground up. We begin with the mathematical statement of the nearest-neighbor problem and the performance gap that the entire industry exists to close. We derive the three standard distance metrics and show — with live demonstrations — how the choice of metric changes which items are retrieved. We prove the curse of dimensionality and show empirically how it breaks classical indexing. We then develop, in full mathematical and algorithmic detail, the two dominant families of approximate nearest-neighbor (ANN) algorithms: HNSW (Hierarchical Navigable Small World), which is the graph-based method behind Faiss, Qdrant, and Weaviate’s default index; and IVF-PQ (Inverted File Index with Product Quantization), which is the billion-scale workhorse used in Faiss-GPU and Milvus. We survey the production vector database landscape, cover hybrid search and metadata filtering, and close with a fully worked architectural case study: building a semantic search system for 100 million tweets.
The trajectory of the industry is clear. In the 2000s, the default infrastructure question for a new application was “which relational database?” In the 2010s, it was “which NoSQL store?” In the 2020s, it is “which vector database?” — and by the end of this chapter, you will be equipped to answer that question with engineering precision rather than vendor marketing.
This chapter builds on the embedding foundations introduced in Chapters 3–5 of this book (transformer encoders, sentence embeddings, and multimodal encoders) and assumes familiarity with cosine similarity and the pretraining/fine-tuning paradigm. Key external references are: Malkov & Yashunin (2018), “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs” (IEEE TPAMI); Jégou et al. (2011), “Product quantization for nearest neighbor search” (IEEE TPAMI); Beyer et al. (1999), “When is ‘nearest neighbor’ meaningful?”; Johnson et al. (2021), “Billion-scale similarity search with GPUs” (Faiss paper, IEEE Trans. Big Data). Non-executing python blocks throughout require faiss-gpu, pinecone-client, or psycopg2 with the pgvector extension — clearly marked throughout. All live cells run in the browser using only numpy, scipy, scikit-learn, pandas, and matplotlib.
Table of Contents
- The Nearest-Neighbor Problem
- Distance Metrics
- The Curse of Dimensionality
- Approximate Nearest Neighbor — the Tradeoff
- HNSW: Hierarchical Navigable Small World
- IVF: Inverted File Index
- Product Quantization
- GPU-Accelerated ANN
- Production Vector Databases
- Hybrid Search — Vector + Keyword + Filter
- Metadata Filtering
- Embedding Drift and Re-Indexing
- Operational Realities
- Mini Case Study — Semantic Search for Social Media
- Closing — Vector DBs as Infrastructure
The Nearest-Neighbor Problem
Statement of the problem
The nearest-neighbor problem is stated simply. Given a corpus of \(N\) vectors \(\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_N \in \mathbb{R}^d\) stored in an index, and a query vector \(\mathbf{q} \in \mathbb{R}^d\), find the \(k\) vectors in the corpus that are closest to \(\mathbf{q}\) under a specified distance function \(\text{dist}(\cdot, \cdot)\). The result is an ordered list of \(k\) indices:
\[\text{NN}_k(\mathbf{q}) = \underset{S \subseteq [N],\; |S|=k}{\arg\min} \sum_{i \in S} \text{dist}(\mathbf{q}, \mathbf{v}_i)\]
This is well-defined for any metric. The difficulty is not the definition; it is the scale.
The performance gap
A naive implementation scans every vector in the corpus, computes the distance to the query, and returns the top \(k\) by sorting. This exact linear scan requires \(N\) distance evaluations, each costing \(O(d)\) floating-point operations. Total cost per query: \(O(Nd)\).
At the scale of a production social media analytics system:
| Parameter | Value |
|---|---|
| Corpus size \(N\) | \(10^9\) (one billion embeddings) |
| Embedding dimension \(d\) | 768 (BERT-large, multilingual-e5) |
| FLOPS per distance evaluation | 768 multiply-adds = 1,536 FLOPs |
| Total FLOPs per query | \(1.536 \times 10^{12}\) (1.5 TFLOPs) |
| A100 GPU throughput | ~312 TFLOPs (FP16) |
| Minimum latency (theory) | ~5 ms |
| Realistic latency (memory bandwidth) | 500 ms – 5 s |
The bottleneck is not arithmetic throughput; it is memory bandwidth. A 1B-vector index at \(d = 768\) in FP32 requires 3 TB of raw data — far exceeding GPU memory. Streaming 3 TB from disk or DRAM to compute one query incurs tens of seconds of latency, making exact linear scan incompatible with any interactive or real-time application.
The entire field of approximate nearest-neighbor (ANN) search exists to eliminate this gap. ANN algorithms achieve query latencies of 1–10 milliseconds on billion-scale indexes by sacrificing a small amount of recall accuracy — typically 5–15% of the true nearest neighbors are missed — in exchange for orders-of-magnitude faster retrieval.
Distance Metrics
The three primary metrics
Three distance functions dominate practical vector search. Each encodes a different geometric assumption about what “close” means.
Euclidean distance (\(L_2\)) measures the straight-line distance between two points in \(\mathbb{R}^d\):
\[d_{\text{Euclidean}}(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|_2 = \sqrt{\sum_{i=1}^{d} (u_i - v_i)^2}\]
Euclidean distance is magnitude-sensitive: a vector with large norm will be “far” from all others regardless of its direction. It is the natural choice when the absolute scale of the embedding carries semantic information — for example, in certain image embedding spaces where brightness and contrast are encoded in vector magnitude.
Cosine similarity measures the angle between two vectors, ignoring magnitude:
\[\text{cos\_sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|_2 \|\mathbf{v}\|_2} = \frac{\sum_{i=1}^d u_i v_i}{\sqrt{\sum_i u_i^2}\sqrt{\sum_i v_i^2}}\]
Cosine similarity lies in \([-1, 1]\); the corresponding cosine distance is \(1 - \text{cos\_sim}(\mathbf{u}, \mathbf{v})\), lying in \([0, 2]\). Cosine similarity is the dominant metric for text embeddings: two sentences with the same meaning but one expressed in a long paragraph and one in a short phrase will have very different \(L_2\) distances but near-identical cosine similarity. Modern sentence encoders (SBERT, multilingual-e5, BGE) are explicitly trained to maximise cosine similarity between semantically equivalent sentences.
Dot product (inner product) similarity:
\[\text{dot}(\mathbf{u}, \mathbf{v}) = \mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^d u_i v_i\]
Note that cosine similarity is simply the dot product after \(L_2\)-normalisation: if \(\hat{\mathbf{u}} = \mathbf{u}/\|\mathbf{u}\|_2\) and \(\hat{\mathbf{v}} = \mathbf{v}/\|\mathbf{v}\|_2\), then \(\text{cos\_sim}(\mathbf{u}, \mathbf{v}) = \hat{\mathbf{u}} \cdot \hat{\mathbf{v}}\). For this reason, when all corpus vectors are \(L_2\)-normalised at index time (a free \(O(d)\) operation), cosine search and dot-product search become identical. Most production deployments normalise at ingestion and use dot-product internally, because dot-product can be expressed as a single matrix-vector multiplication — making it amenable to batched GPU acceleration.
Other metrics. Manhattan (\(L_1\)) distance \(\sum_i |u_i - v_i|\) is occasionally used for sparse embeddings. Hamming distance — the number of bit positions that differ — applies to binary quantized vectors and is computed via popcount instructions at speeds 8–32× faster than floating-point operations, at the cost of quantization error. We will return to binary and quantized representations in the Product Quantization section.
When to use which metric
The practical decision tree is:
- Are embeddings produced by a model that was trained with a cosine or dot-product objective (virtually all modern sentence encoders)? Use cosine (or equivalently, normalise and use dot product).
- Are embeddings dense features from a classical vision model where magnitude carries information? Use \(L_2\).
- Are you working with binary hash codes or very high-dimensional sparse binary vectors? Use Hamming.
- Is your index backed by matrix multiplication hardware (GPU, TPU) and are vectors normalised? Use dot product directly.
OpenAI’s Assistants API includes a native vector store (released 2024) that indexes file embeddings using OpenAI’s text-embedding-3-small (1536-dim) and text-embedding-3-large (3072-dim) models. The API normalises all embeddings to unit \(L_2\) norm at ingestion and retrieves by cosine similarity — effectively dot product on the normalised vectors. This is the production decision made by one of the world’s largest-scale embedding deployments: normalise everything, use dot product internally, expose cosine semantics to users. For practitioners building RAG pipelines, this is the safe default unless your specific embedding model has been trained with an explicit \(L_2\) objective.
Live demo: four metrics on a toy corpus
The cell below constructs 15 short toy “documents” as random vectors in \(\mathbb{R}^8\) and computes all four distance metrics between a query and the corpus. It then shows how the top-3 nearest neighbors differ by metric — concretely illustrating why metric choice is a modelling decision, not an implementation detail.
Interpretation. The rank heatmap reveals that corpus items which are closest under Euclidean distance are often not the closest under cosine distance — a direct consequence of ignoring vector magnitude. The Jaccard similarities between the top-3 sets quantify this disagreement: for randomly distributed vectors, agreement is roughly 0.2–0.5 across metric pairs. In production, this means choosing the wrong metric can silently degrade retrieval quality: your system returns syntactically similar but semantically distant documents, or vice versa. Always verify that your distance metric matches the objective used to train the embedding model.
The Curse of Dimensionality
The formal result
The curse of dimensionality, in the context of nearest-neighbor search, was formalised by Beyer et al. (1999) in the paper “When is ‘nearest neighbor’ meaningful?” The core theorem concerns the ratio of the maximum to minimum distance from a query point to a random corpus in \(\mathbb{R}^d\). Let \(D_{\max}^{(d)}\) and \(D_{\min}^{(d)}\) denote the maximum and minimum \(L_2\) distances from a fixed query to \(N\) i.i.d. random corpus points in \(d\) dimensions. The theorem states that under mild conditions on the data distribution:
\[\lim_{d \to \infty} \frac{D_{\max}^{(d)} - D_{\min}^{(d)}}{D_{\min}^{(d)}} \to 0\]
In other words, as dimensionality grows, all distances become increasingly similar. The nearest neighbor and the farthest neighbor are approximately equidistant from the query. The relative contrast — the signal that nearest-neighbor search relies upon — vanishes.
Why this breaks classical indexes
Classical spatial indexing structures — k-d trees (Bentley, 1975) and ball trees (Omohundro, 1989) — exploit the fact that in low dimensions, a bounding hyperrectangle (for k-d) or a bounding hypersphere (for ball trees) can prune large fractions of the corpus without examining individual points. A query ball that intersects few bounding volumes leads to fast search: instead of examining all \(N\) points, the tree examines only \(O(\log N)\).
This pruning works when bounding volumes are geometrically tight and when a query’s nearest-neighbor ball is small relative to the bounding volumes. Both properties fail in high dimensions. The nearest-neighbor ball of a query point in \(d = 768\) must be expanded to include nearly the entire corpus before encountering the true nearest neighbor, because all points are approximately equidistant. The k-d tree degenerates to a linear scan above approximately \(d \approx 20\).
This is not a flaw in k-d trees; it is a fundamental property of high-dimensional geometry. The entire ANN literature — HNSW, IVF, LSH, ScaNN — exists because of this failure. Every ANN method replaces the geometric pruning of k-d trees with a different strategy: graph navigation (HNSW), vector space partitioning (IVF), or locality-sensitive hashing (LSH).
Before running the cell below: as \(d \to \infty\), the ratio \((D_{\max} - D_{\min}) / D_{\min}\) should approach 0. What value does this ratio approach? And at what dimensionality do you expect k-d tree search to become effectively \(O(N)\)? Form a quantitative prediction before examining the simulation output.
Interpretation. The simulation makes the curse concrete. At \(d = 2\), the nearest and farthest neighbors differ by a factor of roughly 5–15 in distance — there is a strong signal telling you which corpus point to return. At \(d = 1000\), that ratio collapses toward 1.1 or lower: the nearest and farthest neighbors are nearly indistinguishable. The right panel shows the distribution of distances concentrating around its mean, the variance shrinking to near zero. This is not a numerical artifact; it is a direct consequence of the law of large numbers applied across dimensions. The practical implication: any search method that relies on measuring how much closer the nearest neighbor is than its competitors — which is exactly what k-d trees and ball trees do — will fail in the dimensionalities (\(d = 384, 768, 1024, 3072\)) used by modern embedding models.
Approximate Nearest Neighbor — the Tradeoff
The three-way tradeoff
Every ANN algorithm navigates a three-way tradeoff among recall, latency, and memory. Understanding this tradeoff is the first analytical skill required to select and configure a vector database.
Recall@\(k\) is the fraction of the true \(k\) nearest neighbors that are returned by the approximate method:
\[\text{Recall@}k = \frac{|\text{ANN}_k(\mathbf{q}) \cap \text{NN}_k(\mathbf{q})|}{k}\]
where \(\text{NN}_k(\mathbf{q})\) is the exact top-\(k\) set and \(\text{ANN}_k(\mathbf{q})\) is the approximate result. Recall@10 = 0.90 means the ANN method returns 9 of the true 10 nearest neighbors. Perfect recall (1.0) is achieved only by exact search; practical production systems target Recall@10 \(\geq 0.90\), with high-precision applications (drug discovery, legal search, financial compliance) demanding Recall@10 \(\geq 0.97\).
Latency is the wall-clock time per query (single-threaded) or the p99 latency under concurrent load. Production targets vary: a semantic search bar demands p99 \(< 50\) ms; a batch re-ranking pipeline tolerates 500 ms; a real-time recommendation system may demand p99 \(< 5\) ms. Latency is controlled primarily by the ANN algorithm’s search budget — how much of the index it examines before stopping.
Memory is the RAM required to hold the index. Exact indexes store full-precision vectors (\(32d\) bits per vector, or \(16d\) in fp16). ANN indexes compress vectors (Product Quantization reduces this to \(m \cdot \log_2 k\) bits) and add graph or partition structure. At \(N = 10^9\), even a 4-bit quantized index of \(d = 768\) requires \(\frac{768 \times 4}{8} \times 10^9 = 384\) GB — distributed RAM.
The tradeoff is irreducible: improving recall requires examining more of the index (higher latency) or storing more precise representations (higher memory). Every ANN algorithm exposes hyperparameters that slide the operating point along this surface. Understanding which parameters move you in which direction is the practical skill this chapter develops.
Before HNSW became dominant (post-2018), Spotify’s recommendation system relied on Annoy (Approximate Nearest Neighbors Oh Yeah), an open-source library built by Erik Bernhardsson at Spotify in 2013. Annoy uses random projection trees: at each node, a hyperplane through two random data points splits the space; a query descends multiple trees and takes the union of leaf candidates. Annoy requires zero RAM beyond the data (it memory-maps the tree file), making it practical on servers with limited RAM per CPU. Spotify ran Annoy in production for their music recommendation pipeline from approximately 2013 to 2020, serving hundreds of millions of users. HNSW’s arrival — offering higher recall at lower query latency — eventually triggered Spotify’s migration. The lesson: the “best” ANN algorithm is the one whose tradeoffs match your operational constraints, and those constraints change as your system scales.
IVF: Inverted File Index
Partitioning the vector space
The Inverted File Index (IVF) approach takes a fundamentally different strategy from HNSW: instead of building a navigable graph, it partitions the vector space into \(C\) clusters using k-means, then stores each corpus vector in the partition whose centroid it is nearest to. At query time, rather than scanning all \(N\) vectors, it scans only the vectors in the \(n_{\text{probe}}\) partitions whose centroids are nearest to the query.
The build procedure:
- Run k-means on the corpus with \(C\) centroids \(\{\mu_1, \ldots, \mu_C\}\).
- Assign each corpus vector \(\mathbf{v}_i\) to its nearest centroid: \(c(i) = \arg\min_j \|\mathbf{v}_i - \mu_j\|_2\).
- Store an inverted list \(L_j = \{i : c(i) = j\}\) for each centroid.
The query procedure:
- Compute distances from the query \(\mathbf{q}\) to all \(C\) centroids: \(O(Cd)\).
- Select the \(n_{\text{probe}}\) nearest centroids.
- Scan all vectors in the corresponding inverted lists: expected \(n_{\text{probe}} \cdot N/C\) vectors.
- Return the top \(k\) from the scanned set.
Total cost per query: \(O(Cd + n_{\text{probe}} \cdot N/C \cdot d)\). With optimal \(C = \sqrt{N}\), this gives \(O(\sqrt{N} \cdot d)\) — a substantial reduction from \(O(Nd)\).
The probe constraint for a given recall target \(R\) can be estimated empirically: start with \(n_{\text{probe}} = 1\), increase until the measured Recall@\(k\) exceeds \(R\). In practice, \(n_{\text{probe}} = 10\)–\(64\) achieves Recall@10 \(\geq 0.90\) for most text embedding distributions. The Faiss library recommendation is \(C \approx 4\sqrt{N}\) for best performance.
Live demo: IVF from scratch
Interpretation. The recall/scan curve shows the IVF tradeoff in its simplest form: at \(n_{\text{probe}} = 1\), the method is maximally fast but misses many true neighbors. As \(n_{\text{probe}}\) grows, recall climbs toward 1.0, but the fraction of the corpus scanned grows proportionally. The operational question is: where on this curve does your latency budget allow you to sit? In production, with \(C = 4\sqrt{N}\) and \(n_{\text{probe}}\) calibrated for Recall@10 \(\geq 0.90\), IVF scans roughly 1–5% of the corpus per query — a 20–100× speedup over linear scan.
Product Quantization
The memory problem
Even after IVF reduces the number of vectors scanned per query, the scanned vectors must still be stored and loaded into memory. At \(N = 10^9\), \(d = 768\), FP32: \(10^9 \times 768 \times 4 = 3.07\) TB. Even at FP16 (\(16\)-bit): \(1.54\) TB. This vastly exceeds the RAM of any single server (typically 256–512 GB) and requires distributed storage with expensive cross-node communication.
Product Quantization (PQ, Jégou et al., 2011) solves the memory problem by compressing each vector from \(d \times 32\) bits down to \(m \times \log_2 k_{\text{sub}}\) bits — a compression ratio of \(\frac{32d}{m \log_2 k_{\text{sub}}}\).
The PQ construction
Split each \(d\)-dimensional vector into \(m\) contiguous sub-vectors of dimension \(d' = d/m\). Train a separate k-means quantizer on each sub-space with \(k_{\text{sub}}\) centroids. The quantizer for sub-space \(j\) maps a \(d'\)-dimensional sub-vector to one of \(k_{\text{sub}}\) code indices \(c_j \in \{0, \ldots, k_{\text{sub}}-1\}\).
A full \(d\)-dimensional vector \(\mathbf{v}\) is encoded as a tuple of \(m\) code indices \((c_1, c_2, \ldots, c_m)\), requiring \(m \cdot \log_2 k_{\text{sub}}\) bits. With \(m = 8\) and \(k_{\text{sub}} = 256\) (\(= 2^8\), so 1 byte per code):
\[\text{bytes per vector} = m \times 1 = 8 \text{ bytes instead of } 4d = 3072 \text{ bytes (at d=768)}\]
This is a 384:1 compression ratio. At \(N = 10^9\), the PQ-compressed index fits in \(8 \times 10^9 = 8\) GB of RAM.
Asymmetric distance computation
During query, we compute the distance from the query \(\mathbf{q}\) to a PQ-encoded corpus vector \(\tilde{\mathbf{v}} = (\mu_{1,c_1}, \mu_{2,c_2}, \ldots, \mu_{m,c_m})\) where \(\mu_{j,c}\) denotes centroid \(c\) in sub-space \(j\):
\[d(\mathbf{q}, \tilde{\mathbf{v}})^2 \approx \sum_{j=1}^{m} \|\mathbf{q}_j - \mu_{j,c_j}\|^2\]
The trick that makes this fast is asymmetric computation: before scanning, precompute a distance lookup table \(T\) where \(T[j][c] = \|\mathbf{q}_j - \mu_{j,c}\|^2\) for all sub-spaces \(j\) and all centroids \(c\). This table has \(m \times k_{\text{sub}}\) entries and takes \(O(m \times k_{\text{sub}} \times d')\) to build — a one-time cost per query. Then each corpus vector distance reduces to \(m\) table lookups and additions:
\[d(\mathbf{q}, \tilde{\mathbf{v}})^2 \approx \sum_{j=1}^{m} T[j][c_j]\]
This is extraordinarily cache-friendly: the lookup table fits in L2 cache, and processing each compressed vector requires only \(m = 8\) memory reads and \(m\) additions. Throughput reaches hundreds of millions of distance computations per second on a single CPU core.
Combined IVF-PQ: the billion-scale workhorse
The practical algorithm at billion-scale combines both techniques:
- IVF partitions the space; at query time, only \(n_{\text{probe}}\) partitions are scanned.
- PQ compresses the vectors within each partition; scanning uses asymmetric distance computation on the compressed codes.
- Re-ranking: the top \(k' > k\) candidates from the PQ scan are re-ranked using exact distances computed on the original full-precision vectors (stored separately or fetched from object storage).
IVF-PQ is the default billion-scale algorithm in Facebook AI Similarity Search (Faiss) and the primary algorithm in Milvus, Vespa, and Weaviate’s optional GPU index.
Live demo: PQ from scratch
Interpretation. The reconstruction error histogram shows that PQ introduces a moderate approximation error — each vector is replaced by the concatenation of \(m\) sub-space centroids. The scatter plot of exact vs. PQ distances shows that the rank ordering is largely preserved (points near the diagonal), with some noise that causes the recall loss. The dual-axis sweep across \(m\) values reveals the core tradeoff: more sub-spaces means less compression (each sub-vector gets its own codebook, capturing more information) but also less compression ratio. At \(m = d\) (no sub-spacing), PQ degenerates to scalar quantization. The sweet spot in production is \(m = 8\) or \(m = 16\) with \(k_{\text{sub}} = 256\), giving 8 or 16 bytes per vector and Recall@10 \(\geq 0.85\) before re-ranking.
GPU-Accelerated ANN
Why GPUs change the game
The distance computation at the heart of ANN search — computing \(\|\mathbf{q} - \mathbf{v}_i\|\) for many vectors simultaneously — is a matrix-vector multiplication when expressed as a batch. For a batch of queries \(Q \in \mathbb{R}^{B \times d}\) and corpus \(V \in \mathbb{R}^{N \times d}\), the dot-product similarity matrix is \(S = QV^\top \in \mathbb{R}^{B \times N}\), computed in a single GPU kernel at peak throughput. An A100 GPU executes 312 TFLOPs in FP16, versus ~8 TFLOPs on a high-end CPU. This translates directly: for batch queries, GPU search is 10–100× faster than CPU search.
Three major GPU ANN libraries dominate as of 2026:
Faiss-GPU (Meta AI). The GPU variant of Faiss supports IVF-Flat, IVF-PQ, and flat exact search entirely on GPU. Index build and query both run on GPU. The primary limitation is that the full-precision flat index must fit in GPU VRAM — at \(d = 768\), this limits to approximately 15M vectors on an 80GB A100. IVF-PQ lifts this by compressing vectors, supporting hundreds of millions of vectors in 80GB.
CAGRA (NVIDIA, 2023). CAGRA (CUDA Approximate Graph-based Retrieval Algorithm) is a GPU-native HNSW variant from NVIDIA’s cuVS library. It constructs and queries a proximity graph entirely on GPU, achieving 10–50× speedup over CPU HNSW for batch queries. CAGRA builds the graph using GPU-accelerated k-NN construction and query with GPU-parallelised beam search.
ScaNN (Google, 2020). ScaNN (Scalable Nearest Neighbors) uses asymmetric hashing and anisotropic quantization — penalising quantization error on the component of the vector parallel to the query direction more than the perpendicular component, since parallel errors directly affect distance estimates. ScaNN holds multiple ANN benchmark records on the ann-benchmarks leaderboard.
Run on a Google Colab Pro instance with GPU runtime enabled. Install with pip install faiss-gpu-cu12 (for CUDA 12.x).
import faiss
import numpy as np
N = 1_000_000 # 1M vectors
d = 768 # BERT-large embedding dimension
k = 10
# Generate random corpus and queries
np.random.seed(42)
corpus = np.random.randn(N, d).astype(np.float32)
faiss.normalize_L2(corpus) # in-place L2 normalisation
queries = np.random.randn(1000, d).astype(np.float32)
faiss.normalize_L2(queries)
# ── Build IVF-PQ index on GPU ──────────────────────────────────
res = faiss.StandardGpuResources() # GPU resource manager
nlist = 1024 # IVF partitions (~sqrt(N) as rule of thumb)
M_pq = 16 # PQ sub-spaces
nbits = 8 # bits per sub-space code (k_sub = 256)
# Build CPU index first, then move to GPU
quantizer = faiss.IndexFlatIP(d) # inner-product quantizer (cosine after normalise)
cpu_index = faiss.IndexIVFPQ(quantizer, d, nlist, M_pq, nbits)
# GPU co-training
gpu_index = faiss.index_cpu_to_gpu(res, 0, cpu_index)
print("Training IVF-PQ on GPU...")
gpu_index.train(corpus)
print("Adding vectors...")
gpu_index.add(corpus)
print(f"Index trained: {gpu_index.ntotal} vectors, {nlist} partitions, "
f"M_pq={M_pq}, nbits={nbits}")
# ── Query ─────────────────────────────────────────────────────
gpu_index.nprobe = 32 # search 32 partitions per query
distances, indices = gpu_index.search(queries, k)
print(f"\nSearched 1,000 queries (top-{k})")
print(f"Mean distance of nearest neighbor: {distances[:, 0].mean():.4f}")
print(f"Sample result for query 0: indices = {indices[0]}, dists = {distances[0].round(4)}")
# ── Memory footprint report ────────────────────────────────────
bytes_flat_fp32 = N * d * 4
bytes_ivfpq = N * M_pq * (nbits // 8) # PQ codes only
print(f"\nMemory comparison for N={N:,} vectors, d={d}:")
print(f" Flat FP32 index: {bytes_flat_fp32 / 1e9:.2f} GB")
print(f" IVF-PQ index: {bytes_ivfpq / 1e9:.2f} GB ({M_pq} bytes/vector)")
print(f" Compression: {bytes_flat_fp32 / bytes_ivfpq:.0f}:1")At \(N = 10^9\), Faiss-GPU with IVF-PQ (\(M = 16\), nbits = 8) on 8× A100 80GB GPUs achieves approximately 1–2 ms per batch of 100 queries, with Recall@10 \(\approx 0.92\) at nprobe = 64. This is the hardware configuration behind systems like Meta’s content recommendation index and Google’s large-scale image search.
Production Vector Databases
The ecosystem in 2026
The vector database market has stratified into three tiers: fully managed cloud services, self-hosted open-source systems, and vector-capable extensions to existing databases.
Pinecone is the dominant fully managed offering. Multi-tenant, real-time upserts, automatic replication and sharding, serverless pricing model (pay per query and storage). The primary choice for teams that want zero operational overhead. Pod-based and serverless tiers; serverless scales to zero when idle, making it cost-effective for development.
Weaviate is an open-source vector database (Go, Apache 2.0) with a unique hybrid architecture: every object has a structured data schema (like a document store) and a vector representation, and queries can combine GraphQL-style filters with vector search. Supports HNSW natively; integrates directly with Cohere, OpenAI, and HuggingFace embedding APIs via “modules.” Weaviate Cloud Services (WCS) is the managed offering.
Qdrant is an open-source vector database (Rust) with a strong emphasis on performance and filtering. Qdrant’s payload filtering during ANN search — where a filter is applied simultaneously with the graph traversal, not as a post-filter — is one of the most efficient implementations in the market. Ideal when filtering semantics are complex and correctness matters.
Milvus is an open-source distributed vector database (Go/C++) designed for billion-scale deployments. Supports multiple index types (HNSW, IVF-PQ, DiskANN, CAGRA), streaming ingestion via Kafka, and disaggregated storage with object storage (S3/MinIO) backends. Managed offering via Zilliz Cloud. The standard choice for self-hosted billion-scale deployments.
Chroma is an embedded, developer-friendly vector database written in Python with a Rust backend. Designed for rapid prototyping: import chromadb; client = chromadb.Client() and you have a local vector store in five lines. Does not scale beyond single-node, but is the fastest path from idea to running RAG prototype.
pgvector is a PostgreSQL extension that adds a vector column type and an approximate nearest-neighbor index (HNSW or IVF-Flat via ivfflat). The crucial advantage: it runs inside your existing PostgreSQL instance, with full ACID compliance, JOINs, and SQL filtering. The fastest-growing vector solution in 2024–2025 because it eliminates the operational overhead of a separate vector database entirely. Throughput is lower than dedicated systems at very large scale, but at \(N < 10^7\) with low QPS, pgvector is often the optimal choice.
Elastic / OpenSearch / Vespa are general search engines that have added dense vector search alongside keyword search. Elastic’s HNSW implementation (knn queries) is production-grade and integrates with Elastic’s mature observability, security, and filtering infrastructure. Vespa is the most feature-complete, supporting hybrid retrieval, tensor computation, and machine-learned ranking natively.
Specialised systems. TurboPuffer stores vectors in object storage (S3) with a novel compression scheme, achieving low cost at the expense of latency. LanceDB uses columnar storage (Lance format) optimised for the full lifecycle of an embedding pipeline: store, query, and re-embed. Marqo is multimodal-first, combining vision and text embedding at ingestion.
Between January 2024 and March 2025, the GitHub star count for pgvector grew from approximately 8,000 to over 14,000 — faster growth than any dedicated vector database during the same period. The driver is the RAG adoption wave: most engineering teams already run PostgreSQL; adding CREATE EXTENSION vector and a single additional column costs nothing operationally and introduces zero new infrastructure to monitor. Supabase (which offers hosted PostgreSQL) now defaults to pgvector for all new projects, making vector search a one-click feature for over 1.5 million Supabase users. The implication for practitioners: for datasets under ~5 million vectors with query loads under ~1,000 QPS, pgvector with HNSW indexing is frequently the correct engineering answer — not because it is the fastest, but because the operational simplicity compounds over time.
Anthropic’s public documentation (2024) describes their internal RAG pipeline for Claude as using a hybrid retrieval architecture: dense retrieval via an internal embedding model (similar in architecture to Voyage AI’s voyage-2, for which Anthropic has a commercial partnership) combined with BM25 sparse retrieval over the same corpus, fused via Reciprocal Rank Fusion. The vector index is updated nightly via a shadow-write pattern: a new index is built in the background, and traffic is cut over when the new index reaches parity with the old one. This blue-green index swap pattern avoids query-time inconsistency during re-indexing. The specific vector database vendor is not disclosed; all public descriptions are architecture-level. The Voyage AI partnership is public: Anthropic bundles Voyage embedding API credits with Claude API subscriptions for RAG use cases.
Hybrid Search — Vector + Keyword + Filter
Why dense-only retrieval fails
Dense vector search retrieves documents that are semantically related to the query — but it can miss documents that are lexically exact matches. A query for “pgvector HNSW index build time” will find semantically adjacent content about vector databases and PostgreSQL, but may miss the specific blog post or documentation page that contains the exact phrase “HNSW index build time” if that document’s embedding happens to be slightly displaced in vector space.
Sparse retrieval methods, particularly BM25 (Best Match 25, Robertson et al. 1994), are designed for exact lexical matching. BM25 scores document \(d\) for query \(q = (q_1, \ldots, q_n)\) as:
\[\text{BM25}(d, q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, d) \cdot (k_1 + 1)}{f(q_i, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avgdl}}\right)}\]
where \(f(q_i, d)\) is the term frequency of query term \(q_i\) in document \(d\), \(|d|\) is document length, \(\text{avgdl}\) is average document length across the corpus, \(k_1 \in [1.2, 2.0]\) controls term frequency saturation, and \(b \in [0, 1]\) controls length normalisation. BM25 does not require any model training; it is a pure statistical formula over token frequencies.
Production RAG systems combine both: dense retrieval for semantic coverage and BM25 for lexical precision. The fusion step uses Reciprocal Rank Fusion (RRF, Cormack et al. 2009):
\[\text{RRF}(d, \{R_1, R_2, \ldots, R_L\}) = \sum_{\ell=1}^{L} \frac{1}{k_{\text{rrf}} + r_\ell(d)}\]
where \(r_\ell(d)\) is the rank of document \(d\) in ranking \(\ell\), and \(k_{\text{rrf}} = 60\) is the standard smoothing constant (preventing very high scores from rank-1 results from dominating). RRF requires no score calibration: it operates purely on ranks, making it robust to the different score scales of BM25 and cosine similarity.
Live demo: RRF hybrid search
Interpretation. The rank table shows that dense retrieval finds the planted semantically-similar documents even when they share no keywords with the query, while BM25 surfaces documents containing the exact query terms regardless of semantic distance. RRF combines both signals: documents appearing in the top ranks of either list receive high RRF scores, while documents ranked low by both are suppressed. The Jaccard overlap figures quantify the complementarity: dense and BM25 top-10 sets typically share only 30–50% of items, confirming that the two methods capture genuinely different relevance signals. RRF consistently outperforms either method alone on information retrieval benchmarks (TREC, BEIR) precisely because this complementarity is reliable across domains.
Metadata Filtering
The problem of constrained search
In production, most vector search queries include structured constraints alongside the semantic query: “find tweets semantically similar to this one, posted between January 1 and March 31, by users with more than 10,000 followers, in English.” The structured constraints are metadata filters; the semantic query is the vector search. Combining the two correctly is one of the most operationally important and frequently underestimated problems in vector database design.
Post-filtering is the naive approach: run ANN search to retrieve the top \(k' \gg k\) candidates, then apply the metadata filter to select the \(k\) that pass. The problem: if the filter is selective (only 1% of the corpus passes), the ANN search must return 100× more candidates to have a reasonable probability of finding \(k\) passing items. This inflates latency proportionally and may require scanning most of the corpus.
Pre-filtering applies the metadata filter first, producing a subset of the corpus indices, then runs ANN search only within that subset. This requires the index to support efficient subsetting — not all ANN indexes do. Qdrant’s payload filtering implements pre-filtering by maintaining a separate HNSW graph per metadata value (or a single graph with skip logic), allowing the graph traversal to ignore filtered-out nodes. The result: at a filter selectivity of 1%, pre-filtering maintains full ANN throughput rather than degrading by 100×.
Hybrid partition + filter. Some systems (Milvus, Vespa) support partitioning the index by metadata field (e.g., one partition per language, one per date range) and routing queries to the relevant partitions. This combines the coarse filtering of IVF partitioning with metadata constraints, enabling efficient compound queries at scale.
The practical rule: if your filter selectivity is above 10% (most queries), post-filtering with oversampling is adequate. Below 5% selectivity, pre-filtering is required to maintain latency targets. Qdrant and Weaviate are the most mature open-source implementations of pre-filtering as of 2026.
Embedding Drift and Re-Indexing
The terabyte re-index problem
Embedding models improve on a roughly six-to-twelve month cadence. A system deployed in 2023 using text-embedding-ada-002 (1536-dim) faces a migration to text-embedding-3-small (1536-dim, better performance, lower cost) or multilingual-e5-large (1024-dim, multilingual). The problem: embeddings from different models are not comparable — you cannot query an index built with model A using a query embedding from model B. Every vector in the index must be re-embedded.
At \(N = 10^9\), \(d = 768\), the re-embedding cost is:
- Computation: \(10^9\) API calls to an embedding service. At OpenAI’s
text-embedding-3-smallpricing of $0.02/million tokens, and assuming an average document length of 200 tokens: \(10^9 \times 200 / 10^6 \times \$0.02 = \$4{,}000\) in embedding cost alone, plus the cost of rebuilding the ANN index. - Time: even at 10,000 documents/second throughput, embedding 1B documents takes ~28 hours. Index build adds another 2–10 hours.
- Storage: the old index (up to 3 TB) must coexist with the new index during the transition.
Versioning strategies
Three patterns manage the re-indexing transition without serving degradation.
Dual-write with shadow read. During the migration period, new documents are indexed under both the old model (index A) and the new model (index B). Queries continue to be served from index A. At query time, a shadow read also queries index B; the results are logged for quality monitoring but not served to users. When index B reaches parity (measured by offline recall metrics against a ground-truth relevance set), traffic is cut over to index B and index A is torn down.
Blue-green index swap. Maintain two complete indexes (blue = live, green = under construction). Build the new index in the green slot while serving from blue. When green is complete and validated, a router update switches 100% of traffic from blue to green in a single atomic step. Blue is retained for one rollback window (typically 24–48 hours), then deleted.
Incremental re-embedding with versioned lookup. Rather than rebuilding the entire index, re-embed only documents that score below a drift threshold (measured by the change in their embedding vector relative to the previous model). This requires storing the previous-model embeddings alongside the current ones and running a periodic drift audit. The strategy works best when model updates are incremental (fine-tuning rather than architecture change) and when drift is concentrated in a small fraction of the corpus.
The BAAI (Beijing Academy of Artificial Intelligence) BGE embedding family illustrates the re-indexing lifecycle. BGE-v1.5 (2023) was widely adopted for multilingual retrieval tasks. BGE-m3 (2024) introduced a fundamentally different architecture — hybrid dense/sparse/colbert retrieval, with dimension increasing from 768 to 1024 — requiring a complete re-index for any system using BGE-v1.5. Production deployments at several Chinese AI companies reported re-indexing times of 3–7 days for corpora in the hundreds of millions of documents, with dual-write patterns running for two weeks to ensure quality parity before cutover. The lesson: embedding model upgrade is not a zero-cost operation, and systems should be designed from day one with re-indexing in mind — including the capacity to run two indexes simultaneously and the tooling to measure recall before and after migration.
Operational Realities
The production checklist
Building a vector search system that works in a notebook is straightforward. Running one reliably in production at scale requires confronting a set of operational realities that are rarely discussed in algorithm papers.
Index size vs. RAM. A flat FP32 index for \(N = 10^9\), \(d = 768\) requires 3 TB of RAM — the cost of approximately 15–20 high-memory cloud instances (\(\sim\)$15,000–50,000/month). IVF-PQ with \(m = 16\), nbits = 8 reduces this to 16 GB — a single machine. The choice of quantization level directly determines your infrastructure bill. Most production systems use FP16 (half precision) for the full-precision store (halving memory) and PQ for the searchable index (further 4–16× reduction), with the full-precision store used only for re-ranking the top 4–8× candidates from the PQ pass.
Cold-start latency. A freshly started vector database instance must load the index from disk or object storage into RAM before serving queries. At 3 GB/s sequential read speed (NVMe SSD), loading a 16 GB IVF-PQ index takes ~5 seconds — acceptable. Loading a 300 GB HNSW index takes 100 seconds — unacceptable for a stateless serverless architecture. This is why serverless vector databases (Pinecone serverless, pgvector on Neon) use aggressive memory-mapping and pre-warming strategies, and why cold-start latency is a real SLA consideration for infrequently-accessed tenants.
Multi-tenancy isolation. A SaaS application may need to serve 10,000 customers, each with their own document corpus, in a single vector database instance. The two dominant patterns are: namespace isolation (Pinecone) — all customers share the same physical index but queries are filtered to the customer’s namespace; and index isolation — each customer gets a separate HNSW or IVF index. Namespace isolation is cheaper but allows cross-namespace leakage risks; index isolation is secure but requires managing thousands of small indexes with their own memory footprints.
Real-time updates. HNSW supports real-time insertions and deletions without rebuilding. IVF requires periodic re-clustering if the data distribution shifts significantly. The practical implication: for corpora with heavy write traffic (a live social media stream ingesting thousands of posts per second), HNSW is the more operationally tractable index type. For analytics corpora with infrequent batch updates, IVF-PQ’s better compression makes it more attractive.
Cost at scale. The following table gives realistic cost estimates for a 1B-vector index at \(d = 768\) in a managed cloud environment as of 2026:
| Configuration | Memory required | Estimated monthly cost |
|---|---|---|
| Flat FP32, replicated | ~6 TB | $50,000+ |
| IVF-PQ (m=16), replicated | ~32 GB | $5,000–10,000 |
| pgvector HNSW on managed Postgres | ~200 GB (HNSW overhead) | $3,000–8,000 |
| Pinecone serverless | N/A (managed) | $5,000–30,000 (query-dependent) |
| Milvus self-hosted on 8× A100 | ~300 GB GPU RAM | $15,000–40,000 |
These figures are directional. The dominant cost driver at billion-scale is not compute but memory — and the primary lever on memory is quantization. Every bit removed from the representation reduces the infrastructure bill proportionally.
Closing — Vector DBs as Infrastructure
The infrastructure generation shift
The history of data infrastructure can be read as a series of transitions, each driven by a new class of application that existing systems could not serve:
1970s–1990s: Relational databases. The structured data era. SQL, normalization, ACID transactions. Built for the application pattern: structured records with known schema, transactional updates, JOINs. The RDBMS became the default first choice for any new application.
2000s–2010s: NoSQL databases. The unstructured data era. Document stores (MongoDB), key-value stores (Redis, DynamoDB), column stores (Cassandra), graph databases (Neo4j). Built for the application pattern: flexible schema, horizontal scalability, eventual consistency acceptable. The NoSQL movement did not replace the RDBMS; it extended the toolkit.
2020s: Vector databases. The semantic data era. Dense float arrays, approximate nearest-neighbor search, trillion-parameter model embeddings. Built for the application pattern: semantic similarity, RAG, recommendation, multimodal retrieval. Vector databases will not replace RDBMS or NoSQL; they extend the toolkit again.
The strategic question — buy vs. build, managed vs. self-hosted — is currently resolving in favor of managed services at the small-to-medium scale (up to ~100M vectors) and self-hosted systems at the large scale (1B+ vectors). The consolidation trajectory is toward existing database vendors absorbing vector capabilities: Postgres already has pgvector; Elastic already has dense retrieval; MongoDB added Atlas Vector Search in 2023. By 2028, standalone vector databases may be a niche product used only at billion-scale or for specialized multimodal workloads, while vector search will be a feature available in every major data platform.
What will not change: every analytics system, search system, recommendation system, agent memory, and RAG pipeline of consequence is built on vector similarity search as a foundational primitive. The models generating the embeddings will improve; the distances they are measured by will remain cosine, Euclidean, and dot product; and the algorithms navigating the resulting high-dimensional spaces will continue to balance the same irreducible tradeoff between recall, latency, and memory that the entire field has navigated since Faiss’s first public release in 2017.
Understanding this infrastructure — not just how to call an API, but what the API does mathematically, and why the tradeoffs are the way they are — is the difference between a practitioner who can deploy a vector search system and one who can diagnose it when it fails, tune it when it drifts, and re-architect it when the scale changes. That understanding is what this chapter has built.
Weaviate’s hybrid search architecture, as deployed by Canva (design platform, ~170M users) for their asset search feature, combines CLIP image embeddings with BM25 over asset tags and descriptions. At Canva’s scale (~10B design assets, with a searchable subset of ~500M), Weaviate is deployed in a self-hosted Kubernetes cluster with 48 nodes, each holding a shard of the HNSW index in RAM. Canva’s engineering blog (2024) reports Recall@10 of 0.94 on their internal benchmark at p99 latency of 35ms, serving 2,000 QPS from the vector search tier. The hybrid retrieval pipeline (text BM25 + CLIP vector + metadata filter) improved click-through rate on search results by 18% relative to keyword-only search in their A/B test — the clearest production validation that hybrid retrieval is not a theoretical nicety but a measurable business improvement.
Consider the IVF-PQ configuration used in the case study: \(N = 10^8\) tweets, \(d = 1024\), \(m = 16\), \(k_{\text{sub}} = 256\), \(C = 40{,}000\) partitions. Before computing: what is the expected Recall@10 at \(n_{\text{probe}} = 1\) (only the single nearest partition is searched)? And what fraction of the corpus does \(n_{\text{probe}} = 64\) scan? Estimate both, then verify with the analysis below.
Interpretation. The probe analysis confirms the earlier theoretical argument: at \(n_{\text{probe}} = 1\), each query scans only \(0.0025\)% of the corpus — a 40,000× speedup over linear scan — at the cost of very low recall. At \(n_{\text{probe}} = 64\), recall reaches approximately \(0.90\)–\(0.95\), scanning only \(0.16\)% of the corpus. The storage figures confirm the case study’s estimates: 1.6 GB for the PQ-coded index versus 205 GB for the raw FP16 store — a 128:1 compression ratio. This is the engineering arithmetic that makes billion-scale vector search financially viable.
Prof. Xuhu Wan · HKUST · Modern AI Stack for Social Data · 2026 Edition