Vector Database Internals — The Algorithms Behind Approximate Nearest Neighbor Search

The rise of large language models (LLMs) turned the humble "vector database" into a core piece of modern infrastructure: it powers semantic search, recommendation, and Retrieval-Augmented Generation (RAG). Underneath the REST API, however, a vector database is solving a decades-old problem — Approximate Nearest Neighbor Search (ANN) in high-dimensional spaces — with algorithms that mix combinatorial data structures, quantization, and careful systems engineering.

This article builds up the core algorithms layer by layer, starting from brute force, climbing through IVF (Inverted File), Product Quantization (PQ), HNSW (Hierarchical Navigable Small World), and finally DiskANN (Vamana) — with references grounded in real code from FAISS, Milvus, pgvector, and Qdrant.

A mental model to carry through
Picture a vast library.

Brute force is a librarian picking up every book to check how similar it is — hopeless at a billion books.

IVF is a librarian who first glances at the shelf labels, walks only to the 10 shelves whose labels are closest to the book you're looking for, and searches there.

Product Quantization summarises each book's cover into 64 colour codes chosen from a 256-colour palette, then compares colour codes instead of covers — blazingly fast, slightly lossy.

HNSW is a social graph of books linked to "books I'm friends with," navigated top-down from sparse world map to dense street map.

DiskANN is the same graph but living on SSD. Only low-resolution thumbnails of every book (PQ-compressed vectors) stay in RAM; the full originals are pulled from disk only when a candidate looks promising.

The rest of the article is the careful version of this story.

1. Why vector search is hard

1.1 Embedding vectors

An embedding maps text / image / audio into a fixed-dimensional real vector such that semantic similarity corresponds to geometric closeness. Typical dimensionalities:

Model	Dimension	Preferred distance
OpenAI `text-embedding-3-small`	1536	cosine
OpenAI `text-embedding-3-large`	3072	cosine
Cohere `embed-v3`	1024	cosine / dot
BGE-M3	1024	cosine
CLIP ViT-L/14	768	cosine

Dimension $d$ is commonly above 1000, and at that scale brute-force search stops being practical.

1.2 Distances and similarities

For vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^d$ the three common metrics are:

Euclidean (L2):

d_{L2}(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{d} (x_i - y_i)^2}

Inner product (IP; sum of element-wise products of the two vectors):

\text{IP}(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{d} x_i y_i

Cosine:

\cos(\mathbf{x}, \mathbf{y}) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\|\,\|\mathbf{y}\|}

When every vector is normalised to unit length,

d_{L2}(\mathbf{x}, \mathbf{y})^2 = 2 - 2\,\mathbf{x}\cdot\mathbf{y}

so L2, cosine, and inner product become equivalent up to monotonic transformations (transformations that preserve order, so the ranking of results does not change). Most ANN libraries internally use L2 or IP, with optional pre-normalisation.

1.3 The curse of dimensionality

High-dimensional geometry defies low-dimensional intuition:

Distance concentration — the ratio $\max/\min$ of pairwise distances converges to 1; "far" and "near" points become relatively indistinguishable.
Volume concentration — almost the entire volume of a unit ball lives near its surface.
Space partitioning fails — axis-aligned structures like kd-trees (a classical binary-tree index that recursively splits space along coordinate axes) visit essentially every node once $d \gtrsim 20$ .

These facts force us to relax to approximate search — aiming for, say, 95% recall (the fraction of the true top-k that we actually return) rather than exact correctness. This is the premise of ANN.

2. Baseline: brute-force (Flat)

Always start with the baseline. In FAISS this is IndexFlatL2 / IndexFlatIP.

# Naive brute-force search (pseudocode)
def search(query: np.ndarray, xb: np.ndarray, k: int):
    dists = np.linalg.norm(xb - query, axis=1)  # O(Nd)
    idx = np.argpartition(dists, k)[:k]         # O(N)
    return idx[np.argsort(dists[idx])]

Complexity is $O(Nd)$ . With $N=10^6$ , $d=1024$ , float32, a single query streams 4 GB of memory (roughly the byte count of a full-HD movie — per query). With good SIMD (AVX-512 / NEON; CPU instructions that process many numbers in parallel in a single op) and BLAS-style GEMM (highly tuned matrix-multiply routines), brute force can still beat HNSW up to $N \approx 10^5$ . That is exactly why FAISS keeps IndexFlat as the gold-standard reference.

ANN reduces that $O(Nd)$ by either (a) visiting fewer vectors or (b) compressing each distance computation. IVF does (a); PQ does (b); HNSW does (a) differently; DiskANN combines both.

3. IVF: partition the space

TL;DR: don't look at everything — pre-sort the data into buckets of similar items, and at query time open only the few buckets closest to the query. Fast, but if the right answer happens to sit in a bucket you didn't open, you miss it.

3.1 Coarse quantization via k-means

IVF (inverted file index) adapts the classic text inverted index to vectors:

Analogy: finding a good ramen shop in Tokyo. You don't walk all 23 wards; you first pick "the 3 wards closest to my query location," then brute-force within them. IVF first coarsely picks a few cells (centroids closest to the query), then searches exhaustively inside them.

Train: run k-means on the dataset, producing $n_{\text{list}}$ centroids (the "average vectors" that act as a representative for each cluster) $\{\mathbf{c}_1, \dots, \mathbf{c}_{n_{\text{list}}}\}$ (typically $n_{\text{list}} \approx \sqrt{N}$ ).
Build: assign every vector to its nearest centroid; each centroid owns a posting list (the list of IDs of points that belong to that centroid — structurally identical to the list of page numbers filed under one entry in a book's back-of-book index).
Search: pick the top- $n_{\text{probe}}$ centroids closest to the query; scan only the corresponding posting lists.

3.2 Parameter design

nlist — larger means shorter posting lists and faster search, but you also need a larger nprobe to maintain recall. Rule of thumb: $4\sqrt{N} \le n_{\text{list}} \le 16\sqrt{N}$ .
nprobe — number of posting lists visited. nprobe=1 is very fast but low recall because the true nearest neighbour might sit in a cell you didn't open; 10-128 typically yields 90-99% recall.
Training data — FAISS's clustering defaults are min_points_per_centroid = 39 and max_points_per_centroid = 256; roughly 39-256 × nlist samples is the comfort zone (below it you get a warning, above it FAISS auto-subsamples).

3.3 Limits of IVF alone

Inside each cell, IVF is still brute force. Long posting lists become the bottleneck, which motivates combining IVF with Product Quantization (FAISS's IndexIVFPQ).

4. Product Quantization (PQ)

TL;DR: instead of comparing full vectors, compare compressed summary codes. Each vector is sliced into short chunks and each chunk is replaced by the nearest entry in a 256-entry palette, shrinking a kilobyte-sized vector to a few dozen bytes. Memory drops, distance math becomes table lookups, and you trade a bit of accuracy for one to two orders of magnitude in speed.

4.1 Break the vector into sub-codes

PQ^[1] (Jégou et al., 2011) compresses high-dimensional vectors into short integer codes to accelerate distance computation.

Split $\mathbf{x} \in \mathbb{R}^d$ into $m$ subvectors: $\mathbf{x} = [\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(m)}]$ , each $\in \mathbb{R}^{d/m}$ .
In each subspace run an independent k-means, producing $k^* = 256$ centroids $\{\mathbf{c}^{(i)}_1, \dots, \mathbf{c}^{(i)}_{256}\}$ — these are the codebook (think of it as a per-subspace 256-colour palette).
Replace each $\mathbf{x}^{(i)}$ with the 8-bit ID of its nearest centroid.

A $d$ -dim float32 vector ( $4d$ bytes) shrinks to $m$ bytes. For $d=1024$ , $m=64$ that's 4096 → 64 bytes — 64× compression.

4.2 Asymmetric Distance Computation (ADC)

The speed trick of PQ is Asymmetric Distance Computation: the query stays uncompressed, and we precompute a lookup table per subspace.

Split the query: $\mathbf{q} = [\mathbf{q}^{(1)}, \dots, \mathbf{q}^{(m)}]$ .
For each subspace $i$ , precompute distances to all 256 centroids: $\text{LUT}_i[j] = \|\mathbf{q}^{(i)} - \mathbf{c}^{(i)}_j\|^2, \quad j \in \{0, \dots, 255\}$
For a stored vector with code $(id_1, \dots, id_m)$ , the approximate distance is just $m$ table lookups summed: $d(\mathbf{q}, \mathbf{x})^2 \approx \sum_{i=1}^{m} \text{LUT}_i[id_i]$

Zero float multiplications per distance; with $m=64$ it's 64 8-bit adds. SIMD lookup (FAISS pq4_fast_scan) pushes this to millions of vectors per millisecond on a single core.

In short: PQ turns distance computation from "floating-point arithmetic" into "summing values from a precomputed table", and that substitution is the whole reason PQ is orders of magnitude faster.

Play with Product Quantization

Think of it as address compression. Instead of memorising '1-2-3 Jinnan, Shibuya, Tokyo' verbatim, you replace each segment with a dictionary code ('Tokyo→3', 'Shibuya→1', 'Jinnan→5'). This demo uses shrunk parameters (an 8-dim vector split into 4 sub-spaces, each with 4 centroids) so every moving part is visible — real PQ typically runs m=64, k*=256.

Query vector (8-dim = 32 B float32)

q[0]0.60

q[1]0.60

q[2]0.80

q[3]0.40

q[4]0.30

q[5]0.70

q[6]0.50

q[7]0.50

DB vector (PQ-compressed to 8 bits)

sub-space m=0code=3

sub-space m=1code=1

sub-space m=2code=2

sub-space m=3code=1

codes: [3, 1, 2, 1] · 8 bits · 32× compression

True distance (32 B compare)

0.0341

PQ approx distance (8 bit compare)

0.0350 (err 0.03)

Move the sliders. Watch the PQ approximation track the true distance, and notice when a sub-space switches to a different centroid.

A tiny worked example. Take $d=4$ , $m=2$ (subvector dimension 2), $k^*=4$ (2-bit IDs). Query $\mathbf{q} = [1, 0, 0, 1]$ splits into $\mathbf{q}^{(1)} = [1, 0]$ and $\mathbf{q}^{(2)} = [0, 1]$ . If subspace 1's codebook is $\{[1,0], [0,1], [-1,0], [0,-1]\}$ (IDs 0-3), then

\text{LUT}_1 = [\,0,\ 2,\ 4,\ 2\,]

(each entry is $\|\mathbf{q}^{(1)} - \mathbf{c}^{(1)}_j\|^2$ ). Similarly $\text{LUT}_2 = [2, 0, 2, 4]$ . A stored vector with PQ code $(id_1, id_2) = (0, 1)$ (so $\hat{\mathbf{x}} \approx [1, 0, 0, 1]$ ) has approximate squared distance $\text{LUT}_1[0] + \text{LUT}_2[1] = 0 + 0 = 0$ — two table lookups and one add, no multiplication. Real indices simply scale this up: 64 tables of 256 entries each, executed 16-32 codes wide in SIMD.

4.3 Error and tuning

PQ is lossy. Practical knobs:

m — number of sub-quantizers. Must divide $d$ ; $d/m$ in the range 4-16 balances accuracy and speed.
nbits — typically 8 (256 centroids). 16 bits (65k centroids) exists but codebook training gets expensive.
Optimised Product Quantization (OPQ)^[2] — apply an orthogonal rotation $R$ before splitting, balancing variance across subspaces. Recovers several percentage points of recall.

4.4 IVFPQ: the workhorse

FAISS's IndexIVFPQ combines both:

IVF narrows the query to a few cells.
Each cell stores PQ codes of the residual $\mathbf{x} - \mathbf{c}$ (residual PQ), which concentrates codebook capacity.

Geometric intuition for the residual. A raw vector $\mathbf{x}$ can live anywhere in the big ambient space, but once you subtract its cell centroid $\mathbf{c}$ , the residual $\mathbf{x} - \mathbf{c}$ only describes the offset inside that cell — a much smaller, more concentrated distribution. The same 256 PQ centroids can now resolve finer detail, so quantisation error drops. The more cells you have ( $n_{\text{list}}$ ), the tighter the residual distribution becomes.

IndexIVFPQ scales to $N=10^9$ (≈ a billion vectors — roughly 100 × the number of articles in English Wikipedia) on a single machine, with per-query latency ranging from a few ms to a few tens of ms depending on hardware, nprobe, and recall target (see the FAISS wiki Indexing 1G vectors page). Recall tends to be lower than HNSW, but memory usage and GPU amenability win in many production settings (FAISS-GPU, cost-constrained, disk-backed).

What goes wrong in production. If your training data doesn't match the serving distribution, you can hit codebook collapse — most vectors get assigned to a handful of centroids and the rest sit empty, tanking recall. Stay above FAISS's min_points_per_centroid warning, and train on samples drawn from the same distribution you'll serve (e.g. if production is multilingual BGE, match the language mix at training time). IVF has a cousin problem: if data is topic-skewed, cell sizes get lopsided, and a few long posting lists dominate p99 latency even at modest nprobe. Track max/min cell size as a metric so you notice before users do.

5. HNSW: hierarchical navigable small world

TL;DR: pre-build a “social network of vectors” in which similar vectors link to each other, then at query time walk the graph towards the query, zooming from world map down to street map as you go. Today’s default for in-memory vector search — but only works well while everything fits in RAM.

5.1 Inspiration

HNSW^[3] (Malkov & Yashunin, 2016) is the most widely deployed graph-based ANN today. It combines two ideas:

Navigable Small World (NSW) — a graph with both short and long-range edges supports greedy routing (at every step, move to whichever neighbour of the current node is closest to the query — a purely local rule) to any node in logarithmic hops (Kleinberg's small-world model). Kleinberg's key result is that when long-range edges are drawn with probability $\propto 1/r^d$ in the node-distance $r$ — and only in that case — a purely local greedy algorithm reaches its target in $O(\log^2 N)$ hops. HNSW's layered graph is, implicitly, an approximation of exactly this edge distribution.
Skip lists — a randomised layered structure lets search start coarse and refine.

5.2 Structure

HNSW is a layered graph:

Every point lives in Layer 0.
Each point's top layer is drawn from an exponential distribution $\ell = \lfloor -\ln(U(0,1)) \cdot m_L \rfloor$ (default $m_L = 1/\ln(M)$ ).
Each node has at most $M$ edges per layer ( $2M$ at Layer 0).

5.3 Search algorithm

HNSW: greedy descent through 3 layers

Think of zoom levels on a map. First navigate on the world map (L2), then the country map (L1), then the street map (L0).

Layer L2 · 2 nodes

Layer L1 · 5 nodes

Layer L0 · 12 nodes

Enter at entry point b on the topmost layer L2. Greedily move to the neighbour closest to the query.

1 / 8

In pseudocode:

search(q, k):
    ep = entry_point
    for level L in [top_level ... 1]:
        ep = greedy_search_layer(q, ep, ef=1, level=L)
    W = greedy_search_layer(q, ep, ef=efSearch, level=0)
    return top-k of W
 
greedy_search_layer(q, ep, ef, level):
    visited = {ep}
    candidates = min-heap([(dist(q, ep), ep)])   # priority: closest first
    W         = max-heap([(dist(q, ep), ep)])    # keep top ef
    while candidates not empty:
        c = candidates.pop_min()
        f = W.peek_max()
        if dist(q, c) > dist(q, f): break         # pruning
        for each neighbor n of c at `level`:
            if n in visited: continue
            visited.add(n)
            if dist(q, n) < dist(q, f) or |W| < ef:
                candidates.push((dist(q, n), n))
                W.push((dist(q, n), n))
                if |W| > ef: W.pop_max()
    return W

Key points:

efSearch — size of the beam. Larger gives higher recall, lower QPS. 50-200 in production, 300+ when chasing 99% recall.
Upper layers use ef=1 — a plain greedy descent.
Layer 0 uses ef=efSearch — beam search (keep several candidates alive at once and expand them in parallel — a wider, breadth-first style of walk) to broaden candidates.

In short: go "roughly in the right direction" in the upper layers, then "walk several candidates forward in parallel" in the bottom layer to recover precision.

5.4 Insertion and neighbor selection

insert(x):
    ℓ = floor(-ln(rand()) * m_L)
    ep = entry_point
    for L in [top_level ... ℓ+1]:
        ep = greedy_search_layer(x, ep, ef=1, L)
    for L in [min(top_level, ℓ) ... 0]:
        W = search_layer(x, ep, ef=efConstruction, L)
        neighbors = select_neighbors_heuristic(x, W, M)
        add bidirectional edges between x and neighbors
        for each n in neighbors:
            if degree(n) > M_max:
                W' = neighbors(n)
                neighbors(n) = select_neighbors_heuristic(n, W', M_max)
        ep = W
    if ℓ > top_level: top_level = ℓ; entry_point = x

Neighbor selection decides HNSW's quality. Rather than " $M$ nearest neighbours", HNSW uses a diversity heuristic:

select_neighbors_heuristic(x, W, M):
    R = []
    sort W by dist(x, ·) ascending
    for e in W:
        if |R| == M: break
        if ∀ r ∈ R: dist(x, e) < dist(r, e):
            R.append(e)
    return R

Keep $e$ only if $x$ is closer to $e$ than any already-selected neighbour is. This spreads neighbours across directions, preventing the graph from collapsing onto a single cluster — a key reason HNSW avoids local minima during search. faiss/impl/HNSW.cpp's shrink_neighbor_list implements this.

In short: instead of "take the $M$ nearest", HNSW picks "the $M$ near-ish neighbours that aren't redundant with each other", so edges fan out in all directions and greedy search rarely gets stuck in a local pocket.

The paper's full Algorithm 4 takes two additional optional flags: extendCandidates (grow $W$ by the neighbours of each candidate before selecting — useful on sparse upper layers) and keepPrunedConnections (if $|R| < M$ at the end, refill from the pruned set in distance order). Many implementations including FAISS default both to false; the paper recommends extendCandidates=true only on upper layers.

5.5 Parameters and cost

M: 12-48 typically. Higher gives better recall at the cost of memory ( $M \cdot 4$ bytes of links per point).
efConstruction: 100-500; build-time only; higher improves index quality but is linear in build time.
Complexity: search expected $O(\log N)$ , build $O(N \log N)$ .
Memory: $N$ vectors + roughly $N \cdot M_0 \cdot 4$ bytes of links (most nodes live only in Layer 0, so upper-layer links are negligible). $M=16$ ( $M_0 = 2M = 32$ ), $d=1024$ , $N=10^6$ : 4 GB vectors + ~128 MB links (the link overhead is only ~3% on top of the raw vectors).

Downsides: memory-hungry and hard to delete from. Actually removing a node and re-wiring its edges can break graph connectivity, so most implementations mark the node with a tombstone (a "deleted" flag that leaves the node in the graph but hides it from query results). Over time the tombstoned nodes clutter the graph: search paths still route through them, and regions of the graph can become effectively unreachable, so recall silently degrades even at high efSearch. A common operational rule is to trigger a full rebuild (compaction — periodically rewriting the index so tombstoned nodes are physically removed) once the tombstone ratio crosses 5-10%.

6. DiskANN / Vamana: a graph index for SSDs

TL;DR: same “walk a friend graph” idea as HNSW, but the graph lives on SSD and only a PQ-compressed thumbnail sits in RAM. Unlocks billion-vector single-node indexes at the cost of SSD IOPS — throughput under high concurrency becomes the new bottleneck.

6.1 Motivation

HNSW assumes every vector fits in RAM. A billion 1024-dim float32 vectors is 4 TB (several times the 512 GB–1 TB DRAM of a typical server — clearly doesn't fit on a single machine). DiskANN^[4] (Microsoft, 2019) introduces a single-layer graph index designed to run from SSD.

6.2 The Vamana graph

Like HNSW, search is greedy, but there is no layered hierarchy — a single layer mixes short and long edges. The entry point is the medoid of the dataset, computed once at build time. Unlike the centroid (an abstract arithmetic mean that may not correspond to any real vector), the medoid is the actual data point closest to the centroid — i.e. the data point that sits most "centrally" in the distribution. Starting greedy search from there minimises the expected number of hops to any query target, since on average no point in the dataset is too far from the medoid. Construction uses $\alpha$ -pruning:

build_vamana(X, R, α):
    G = random graph (R edges per node)
    for two passes (α=1.0, then α=1.2):
        for p in random permutation of X:
            V = greedy_search(G, p, ep, L)   # candidate set
            neighbors(p) = robust_prune(p, V, α, R)
            for n in neighbors(p):
                add edge n→p
                if degree(n) > R:
                    neighbors(n) = robust_prune(n, N(n)∪{p}, α, R)

robust_prune ( $\alpha$ -pruning):

robust_prune(p, V, α, R):
    V = V sorted by dist(p, ·)
    P = []
    while V not empty and |P| < R:
        p* = V.pop_closest()
        P.append(p*)
        V = {v ∈ V : α · dist(p*, v) > dist(p, v)}   # drop v dominated by p*, keep diverse (long-range) edges
    return P

With $\alpha=1$ on the first pass and $\alpha > 1$ (typically 1.2) on the second, Vamana keeps more long-range edges than HNSW's diversity heuristic. That is what makes greedy search work well on SSD.

In short: raising $\alpha$ above 1 discards "nearby, similar" neighbours in favour of "farther-away, differently-oriented" ones, so a single hop can jump further across the graph. On SSD, one hop = one random read, so fewer hops translate directly to lower latency.

6.3 SSD layout

Vectors and adjacency lists are co-located in 4 KB pages (Vamana page layout).
Visiting a node on the search path costs one 4 KB random read that returns both the vector and its neighbours.
To keep recall, DiskANN keeps a PQ-compressed copy in memory for coarse distance pruning, and only reads the full-precision vector for a handful of candidates.

6.4 IOPS and recall

On a single NVMe SSD, DiskANN typically reaches 50-150 random 4 KB reads per query (the paper's SIFT1B benchmark reports roughly 100 reads at recall@1 = 95%), p99 5-10 ms — at scales (10B+ vectors) far beyond what HNSW can hold in RAM for the same budget. FreshDiskANN (2021) adds background insert/delete with periodic re-merges.

What goes wrong in production. Those numbers are for low-concurrency workloads. Consumer SSDs top out around 50-200 k random 4 KB read IOPS (I/O operations per second); server NVMe reaches 500 k-1 M. As concurrent QPS × (reads per query) approaches that ceiling, the device queue saturates and tail latency blows up non-linearly — p99 (the worst 1% of queries) can jump from 10 ms to hundreds of ms even while the median barely moves. Standard mitigations: stripe across multiple SSDs (RAID-0), crank up PQ compression so more pruning happens in memory before hitting disk, or lower beam_width to visit fewer nodes per query.

7. ScaNN: Google's anisotropic quantization

Before moving on, it is worth circling back to §4's PQ to look at one important refinement. Google's ScaNN^[5] (2020) targets maximum inner product search via a more careful kind of quantisation (compressing each vector down to a short sequence of bits).

Regular PQ minimises reconstruction error $\|\mathbf{x} - \hat{\mathbf{x}}\|^2$ . But for IP search, what matters is the error in $\mathbf{q} \cdot \mathbf{x}$ , which is dominated by the component parallel to $\mathbf{q}$ . ScaNN's anisotropic quantization loss weights the two components differently:

L(\mathbf{x}, \hat{\mathbf{x}}) = h_\parallel(\mathbf{x}) \cdot \|(\mathbf{x} - \hat{\mathbf{x}})_\parallel\|^2 + h_\perp(\mathbf{x}) \cdot \|(\mathbf{x} - \hat{\mathbf{x}})_\perp\|^2

with $h_\parallel \gg h_\perp$ . This learns a quantizer that minimises the error direction that matters for IP. Empirically, ScaNN lifts recall by 5-10% at the same bit budget.

8. Filtered search (pre / post / in-filter)

Real systems need metadata-filtered ANN: "only published docs", "only this tenant".

Post-filter: retrieve $k' \gg k$ via ANN, apply the filter afterwards. Breaks when selectivity (the fraction of rows that pass the filter) is low.
Pre-filter: materialise the filter-matching ID set, brute-force within it. Good for very selective filters, loses ANN speed.
In-filter / integrated: keep only filter-matching nodes as candidates during graph traversal. Each system handles reachability differently. Current Milvus docs describe two modes: standard filtering (scalar pre-filter, then ANN within the matching set) and iterative filtering (vector-search iterator with per-hit scalar checks); the query planner picks between them. Qdrant uses a filterable HNSW index that adds extra links into the graph based on payload categories so the graph stays connected under filtering (and falls back to the payload index alone when cardinality is too low).

Recent research — ACORN (2024), Filtered-DiskANN (2023) — builds graphs that are filter-aware from the start.

When does each strategy win?

Strategy	Works well when	Fails when	Typical home
Post-filter	selectivity ≥ 50% (most rows pass)	selectivity ≤ 1%; top- $k$ ends up empty	pgvector, naive Elasticsearch
Pre-filter	selectivity ≤ 1% (subset is tiny)	mid-range selectivity — ANN is wasted and the linear scan dominates	Milvus standard filtering
In-filter	the messy middle, selectivity 1–50%	harder to implement; graph connectivity must be engineered in	Qdrant filterable HNSW, Filtered-DiskANN

9. Implementations compared

A cheat sheet for how today’s popular OSS and SaaS products combine the algorithms from §2–§8. A rough mental model: FAISS is a library, Milvus / Qdrant / Weaviate are full distributed-DB products, pgvector is a Postgres extension, and Pinecone is a fully managed SaaS.

Library	Core algorithms	Notes
FAISS	Flat, IVF, IVFPQ, HNSW, OPQ	Meta. GPU + SIMD. The research reference.
Milvus	Wraps FAISS / HNSW / DiskANN	Distributed, persistent, multi-tenant. Kubernetes-first.
Qdrant	HNSW (custom Rust)	Strong payload filtering, gRPC/REST.
Weaviate	HNSW	Module-based vectorisers, GraphQL API.
pgvector	IVFFlat, HNSW	PostgreSQL extension. Joins with existing SQL.
Pinecone	Proprietary hybrid	Fully managed SaaS.
LanceDB	IVFPQ, HNSW over Arrow/Parquet	Columnar persistence — think "DuckDB for embeddings".
Elasticsearch / OpenSearch	Lucene HNSW	BM25 + vector hybrid in one engine.

9.1 A selection flowchart

10. Hybrid search: BM25 + vector

In practice, combining lexical (BM25; a classical relevance score from the TF-IDF family that weights how many of the query's words appear in each document) and semantic (vector) beats either alone.

10.1 Reciprocal Rank Fusion (RRF)

\text{score}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}

$R$ is the set of retrievers (BM25, vector, …), $k \approx 60$ by convention. No score normalisation required, which is why Elasticsearch and Weaviate use RRF as a default hybrid strategy.

10.2 Cross-encoder re-rank

Retrieve 100 with ANN, re-score each with a BERT-style cross-encoder. This is essentially the standard RAG pipeline.

Bi-encoder vs cross-encoder. Vector search itself uses a bi-encoder: query and document are embedded independently, and similarity is a cheap cosine / dot product between the two vectors. Because document embeddings can be precomputed offline, a bi-encoder scales to billions of docs. A cross-encoder instead concatenates the (query, document) pair and runs them through one Transformer so every attention head can see both sides at once — this captures interactions a bi-encoder cannot, giving noticeably better relevance, but you pay a full forward pass per (query, doc) pair at query time. The two-stage pattern — cheap bi-encoder to get 100 candidates, expensive cross-encoder to re-rank only those — is the cost/quality sweet spot.

11. Production metrics

Metric	Meaning
Recall@k	overlap between returned top-k and true top-k
QPS	queries per second
p50 / p95 / p99 latency	median / top-5% / top-1% latency; the tail (especially p99) is where GC pauses and disk IO stalls show up
Build time	batch reindex time
Index size	shows compression efficiency
Insert/delete throughput	whether streaming updates are viable

ann-benchmarks is the standard framework for comparing HNSW / IVFPQ / ScaNN etc under identical conditions. The canonical visualisation is a recall vs QPS Pareto frontier plot. The Pareto frontier is the set of operating points such that you cannot improve recall without sacrificing QPS, or vice versa — an algorithm whose frontier sits further out dominates the others at every trade-off. The point of plotting frontiers (rather than a single number) is that ANN algorithms rarely win outright: method A may beat B at 95% recall yet lose at 99%, and only the frontier shows that crossover clearly.

12. State of the art (as of 2026)

What’s likely to become the next default, as of 2026. The running themes are bigger scale, tighter compression, and easier updates.

SPANN / SPFresh (Microsoft, 2021-2023) — IVF trees kept on SSD with clustered posting lists for easier updates; a DiskANN successor line.
Learned indexes — an active research direction where neural nets adapt centroid selection or hierarchical structure to the data distribution (not yet upstreamed into FAISS).
RaBitQ — a randomised binary quantisation with tighter error bounds than PQ (Gao & Long, 2024). FAISS has landed faiss::IndexRaBitQ / IndexIVFRaBitQ; pgvector 0.8 is evaluating it.
GPU HNSW (CAGRA, 2023) — NVIDIA RAPIDS reformulates HNSW construction and search as CUDA-friendly graph ops.
Hybrid DRAM / CXL layouts — after Optane was discontinued, the new direction is NVMe + big RAM + CXL memory pools for very large working sets.

13. Putting it all together

The design space of vector search can be summarised in three orthogonal axes:

Axis	Technique
Reduce candidate set	IVF / Graph (HNSW, Vamana)
Compress distance	PQ / OPQ / ScaNN / RaBitQ
Exploit the memory hierarchy	DiskANN / SPANN (SSD), CAGRA (GPU)

In practice, four questions drive the choice: N, memory budget, update rate, recall target.

$N \le 10^5$ → Flat
$10^5 \le N \le 10^7$ , RAM plentiful → HNSW
$10^7 \le N$ , RAM constrained → IVFPQ
$N \ge 10^9$ → DiskANN / SPANN
Already on PostgreSQL → pgvector (HNSW)
Need distributed → Milvus / Qdrant / Weaviate

Vector DBs aren't magic. Once you see them as classical inverted indexes extended to high-dimensional space, the whole architecture becomes legible — and your ability to debug recall regressions, size capacity, and triage incidents changes entirely.

References

H. Jégou, M. Douze, C. Schmid, "Product quantization for nearest neighbor search", IEEE TPAMI, 2011.
T. Ge et al., "Optimized Product Quantization", IEEE TPAMI, 2014.
Y. Malkov, D. Yashunin, "Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs", 2016. arXiv:1603.09320
S. Jayaram Subramanya et al., "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search on a Single Node", NeurIPS, 2019.
R. Guo et al., "Accelerating Large-Scale Inference with Anisotropic Vector Quantization", ICML, 2020.
FAISS wiki — the de-facto reference.
ann-benchmarks — recall vs throughput comparisons.