2025-05-18
This article can be seen as an extension of the "Data Structures and Algorithms" series, introducing new data structures based on previously published related blog posts. The "total order dictionary" problem and the approximate K nearest neighbor problem for vectors can be solved with the same concise and powerful idea.
We want to implement the following interface with a data structure, which we call a "total order dictionary":
Store key-value pairs, look up values by keys, and the user guarantees that the keys are totally ordered. The expected time for lookup, addition, and deletion is O(log n). You can find the largest or smallest key in O(log n) time, and find the "successor" or "predecessor" of a key in O(log n) time. Subscript access is not supported.
Note the difference between this interface and an "ordered vector," described below, which can be naively implemented with a vector and binary search algorithm.
The user guarantees that elements must be totally ordered. O(n) to insert elements, O(n) to delete elements, O(log n) to check if an element exists. O(1) to access the k-th element by subscript.
To implement a "total order dictionary," the most common idea is a balanced binary search tree (such as an AVL tree). However, besides them, there is a rather concise randomized algorithm - Skip List.
Note: The content of this section and several sections below is organized from data structure lecture notes, which can be obtained from the "Lecture Archive" on this page.
The skip list has a clear two-dimensional structure: viewed horizontally, the skip list consists of multiple levels, each level being a linked list; viewed vertically, the skip list is a series of towers of varying heights, each tower representing an element. Using a Wikipedia illustration:
The multi-level horizontal structure allows us to quickly look up elements by key: we start from the top level and jump down level by level. In each level, we continuously follow the linked list to the right until the element on the right is greater than the key we are looking for; at this point, we descend one level.
Regarding the height of the towers, it is randomly sampled from a geometric distribution when inserting elements: for example, with p = 1/2, it is equivalent to starting from the bottom level and continuously flipping a coin to decide: 1) grow one level up, or 2) completely stop growing. The expectation of the geometric distribution is 1/p, so the storage overhead caused by creating nodes in multiple levels is also this value, which is within a controllable range.
According to the description of the query process above, the complexity of the query depends on the number of levels in the skip list and the number of jumps per level.
Assuming there are n elements in total, the upper bound of the probability of a node appearing at level h is $n*p^h$. Therefore, we can assert with any high probability that the highest level does not exceed O(log(n)). This limits the number of levels in the skip list.
Horizontal jumps can only land on towers of the same height; otherwise, if we land on a taller tower, we should have jumped to this landing point in the previous level's horizontal jump. Counting to the right, how many towers of the same height are there before hitting a "wall" of a taller tower? Each count has a probability of $P(h>k|h\geq k) = p$ of "hitting a wall." Therefore, the number of jumps per level also follows a geometric distribution, with an expectation of 1/(1-p).
Therefore, the complexity of the query has an upper bound of log(n) / (1-p).
Using the key of the new element as a parameter, call the query algorithm, and then you can insert and delete elements at the found position. For the elements to be inserted or deleted, you only need to maintain the connection relationships of the linked list nodes on each level.
In information retrieval for text, images, etc., we often represent queries and candidate documents as vectors and select the documents closest to the query to return. The problem is how to find the k candidate vectors closest to the query vector. This is called the K-NNS (k-Nearest Neighbor Search) problem. If approximation and error are allowed, it is called the K-ANNS (k-Approximate Nearest Neighbor Search) problem.
We want to find the existing vectors that are neighbors to a new vector given by the user. Generally, an index structure is maintained to accelerate the search, and this structure can dynamically add and delete existing vectors.
One idea is to convert the "neighbor" relationship into "order" on each dimension. Then, establish a "cube" near the target vector and apply range query algorithms such as:
Range Tree (range tree). Recall that the idea of a range tree is "trees nested in trees": the range tree on the i-th dimension stores a range tree on the i+1 dimension in its internal nodes; until the last dimension, it degenerates into a binary search tree. Its construction time/space complexity is $O(n (\log n)^{d-1})$, and since our vector representation is usually high-dimensional, we encounter the curse of dimensionality when constructing the range tree, making it infeasible.
kd tree: Recall that the idea of a kd tree is "alternating dimensions for binary partitioning." Its space complexity is O(n), and it can be constructed even in high dimensions. Its query complexity is $O(k + n^{1-1/d})$. Although the worst-case complexity O(n) is almost the same as brute force querying, is it possible that it could be faster in practice because the "cube" is very small?
For kd trees, directly applying "cube queries" does not seem to be the best method for solving the nearest neighbor problem. Performing DFS on the kd tree based on the distance to the target vector is more suitable for our problem, simple and straightforward.
Refer to this lecture note for the specific algorithm:
Start from the root of the tree
Keep descending, each time choosing the child node corresponding to the space where the target vector is located
Upon reaching a leaf node, initialize the candidate element heap with the elements in the leaf node, with the farthest candidate element at the top of the heap
Keep backtracking. For each node, try to add the node's own "boundary" element to the candidate element heap. Calculate the distance from the target vector to the decomposition plane, and if this distance is less than the distance at the top of the heap, perform DFS on the other unvisited subtree; otherwise, continue backtracking.
You can use the following diagram to help understand this algorithm, image from pyimagesearch.com. Taking K=1 as an example:
In steps 2 and 3, we reach the leaf node of region A and take the only vector there as the current best
In step 4, we find that the distance from Q to the boundary between A-B is less than the current best distance, so we perform DFS on subtree B again, effectively updating the current best to (0.42, 0.65)
Backtracking again, this time the distance from Q to the (AB)-(DC) boundary is greater than the current best distance, ending the search.
It is not difficult to see that this algorithm can find the exact K nearest neighbors. However, its query complexity cannot be guaranteed, and we can only hope that its actual running speed is relatively fast.
If we do not need to find the exact K nearest neighbors and allow for some error, we can sacrifice accuracy and use an algorithm with lower complexity. If the "order" relationship leads to a linear structure, a natural extension is that the "neighbor" relationship leads to a network (or a graph in graph theory).
Continuing the analogy, in a linear structure, searching for elements "greater than" or "less than" a certain value can be done by starting from an entry point and continuously moving to predecessor or successor nodes. Similarly, in a network, searching for elements "near" a certain value can be done by starting from an entry point and "navigating" continuously towards the neighbors of the current node, for example, "greedily" choosing to move towards the neighbor closest to the target. This is known as greedy routing.
To achieve fast searching, some form of explicit or implicit hierarchical structure is almost indispensable. Researchers initially utilized an implicit structure: small world networks (small world network) with both short-range and long-range connections, which to some extent is equivalent to a hierarchical structure—during navigation, greedy choices allow you to first use long-range connections to quickly approach the destination, and then use short-range connections for precise narrowing. Here, short-range and long-range are naturally determined based on the norm in vector space.
The algorithm for building small world networks matches our "information retrieval" application scenario—it does not require prior information and can be done online. Each time a vector is added, we only need to connect it to its nearest M neighbors. During the continuous addition of vectors, the connections formed earlier are more likely to serve as long-range connections (because there were fewer candidate nodes at the time, so the selected nodes are farther away). (Reference paper)
For the K-ANNS problem, the widely adopted solution today adds an explicit structure to the above small world network solution through a "hierarchical" approach. Each layer is a small world network, with fewer nodes in the network as you go higher, and each vector appears as a node in the network of the first few (unequal) layers, with the "growth" height following a geometric distribution - identical to the vertical structure of skip lists! This method is called Hierarchical Navigable Small World, abbreviated as HNSW. Using the illustration from the original paper:
So, the query algorithm and the insertion/deletion algorithm can somewhat mimic skip lists.
Starting from the entry point at the highest level, descend layer by layer, using greedy routing in each layer. Describing the HNSW query process as "navigation" might not be accurate; a more precise term is "best first search," which has been detailed in previous blog posts about graph algorithms. The latter is not like navigation because it is stateless:
In the Navigation/Routing process, "you" are at a specific position at any given time and can only move to adjacent positions.
It is also unlike DFS (both recursive and simulated). For example, in its simulated form (often called backtracking), like in chess problems, there is a board state at each step, and only one piece can be moved at each step.
Best first search, as a type of priority search, relies on a priority queue to maintain the nodes to be visited. It may happen that "in one round you visit here, and in the next round, you visit somewhere far away."
The specific algorithm is a variant of best-first search, mentioned in previous blog posts, with pseudocode copied below:
use entry point to initialize priority queue
let W be a candidate pool of max size K organized as a heap with worst on top
while priority queue is not empty
pop node v with top priority
mark v as visited
if v is worse than every candidate in W:
break
for each neighbor u of v who is not visited:
mark u as visited
if u is better than the worst candidate in W
push u to W
push u to priority queue
Each layer's best-first search can accept multiple entry points and can output multiple (here K) candidate vectors. Therefore, we can treat the candidate vectors output from the previous layer as the entry points for the next layer. In the actual algorithm, the process of descending layer by layer uses K=1, and only the bottom layer outputs K>1 (equal to k in the k-NN problem definition) candidate vectors.
Note: Due to the reasons described earlier, the case of K=1 cannot be implemented as "only looking at immediate neighbors at each step and stopping once all neighbors are farther than oneself."
First, generate the "height" of the new vector based on a geometric distribution.
Then comes the "search" step, descending from the highest level to this height, using best-first search at each level to output K=1 candidate vectors, obtaining an entry point for the new vector's height.
For each level from the new vector's height and below, we need to actually insert this new vector, so we need to select its M neighbors. The method is to choose a K>M to descend layer by layer, and then select the K nearest neighbors (or use other heuristic criteria) from the candidate vectors output at each level to connect with the current vector.
Details and pseudocode of the above query and insertion algorithms can be found in the paper. Finally, here is a Python code repository containing a proof of concept implementation of Hierarchical Navigable Small World Networks (HNSW); for a high-performance implementation of HNSW, you can refer to the FAISS library, which is based on C++.