Point Search, Interval Trees, and Segment Trees

2024-10-12

Orthogonal Range Search in computational geometry aims to quickly respond to object searches within axis-aligned "rectangular/cuboidal" ranges.

Object Search for Points

Range Tree

One-Dimensional Search

An ordered vector can solve this problem, but it cannot scale to higher dimensions. Alternatively, we use a preprocessing method to build a balanced binary tree, where the leaf nodes are the objects themselves, and the internal nodes store the maximum value of the objects contained in the left subtree. If the objects are already sorted, preprocessing is O(n); otherwise, it is O(n log n).

To query objects within the range [a, b], assuming there are k such objects in total, the symbol k used in the analysis of query time complexity later will also have this meaning. You only need to search for a and b in the search tree, reaching leaf nodes u and v. Assume the paths from the tree root to u and from the tree root to v diverge after z, then the method to list objects within the range is: from u to z, report all "right" subtrees; from z to v, report all "left" subtrees.

If you only need to find the number of objects k within the range, the query is O(log n); if you want to list each object, the query needs to traverse each "subtree," which is O(k + log n). This data structure is called a range tree.

Two-Dimensional Search

We need to search for points within the $[x1, x2] * [y1, y2]$ rectangular range. Extending the one-dimensional approach, first ignore the y-coordinate and build a range tree for the x-coordinate (x-range tree). Then, at each internal node, build a range tree for the y-coordinate (y-range tree) for the points contained in its subtree. This results in a 2D range tree.

The search method is to first search the x-range tree to obtain O(log n) subtrees; then search the y-range tree for each subtree. Therefore, the total time complexity is $O(k + (\log n)^2)$.

Considering the space complexity of the 2D range tree, the size of each y-range tree subtree at depth h in the x-range tree is $m=O(n/2^h)$. Therefore, the sum of the sizes of the x-range tree subtrees can be calculated by summing by depth $\Sigma_{h=0}^{\log n} 2^h (n/2^h) = 1*n + 2*(n/2) + 4*(n/4) + ... = n \log(n)$, so the overall space complexity is O(n log n).

Considering the time complexity of constructing the 2D range tree, the naive method is to build the x-range tree in O(n) time, and then independently build the y-range tree for each subtree, taking $\Sigma_{h=0}^{\log n} n * O(\log(n/2^h)) = n * \Sigma_{h=0}^{\log n} O(\log n - h) = O( n (\log n)^2 )$. However, if each subtree's y-range tree is constructed bottom-up, the y-range trees of the left and right children can be used to perform O(m) merging instead of O(m log m) sorting, so the time complexity is reduced to O(n log n).

d-Dimensional Search

Following the above approach, constructing a d-dimensional range tree, it is not difficult to see that its query time is $O(k + (\log n)^d)$. It is not difficult to prove by mathematical induction that the time complexity of constructing it is $O(n (\log n)^{d-1})$. The reason: each time a dimension is added, $\Sigma_{h=0}^{\log(n)} n (\log (n/2^h))^{d-1} = O(n \log n * (\log n)^{d-1}) = O(n (\log n)^{d}) $.

Another way to derive the preprocessing complexity is the Master Theorem [Note 1]. Let the complexity of building a d-dimensional range tree for n objects be $T_d(n)$. Since the range tree is constructed bottom-up, consider the last step: after constructing two range trees of half the size, construct the lower-dimensional range tree corresponding to the tree root.

$$ T_2(n) = 2*T_2(n/2) + O(n) \implies T_2(n) = O(n \log n) $$ $$ T_3(n) = 2*T_3(n/2) + T_2(n) \implies T_3(n) = O(n (\log n)^2) $$

And so on,

$$ T_d(n) = 2*T_d(n/2) + T_{d-1}(n) = 2*T_d(n/2) + O(n (\log n)^{d-2}) \implies T_d(n) = O(n (\log n)^{d-1}) $$

Using Fractional Cascading

Fractional Cascading links related data structures together to quickly locate from one structure to another. Using fractional cascading.

The simplest example of Fractional Cascading (FC) is, given two ordered vectors A and B, where all elements of B are contained in A, to find which elements of A and B are within the range [x1, x2]. By establishing links from elements in A to elements in B, you can first perform a binary search on A, and then list the elements in B through the links, avoiding a binary search on B.

A: 11 13 14 15 17 19 23 37 42 45
   |     |  |        |  |
B: 11    14 15       23 37

Applied in range trees, we replace the 1d range tree with an ordered vector in the last dimension. In the penultimate dimension, link the ordered vector of the parent node with the ordered vectors of the left and right child nodes (this is because the ordered vectors of the left and right child nodes are contained in the ordered vector of the parent node). When querying the last two dimensions [x1, x2] x [y1, y2], start from the top of the tree, decide the descending direction based on [x1, x2], and maintain the index range corresponding to the [y1, y2] value range in the ordered vector. Doing so, the entire search process is O(k + log n).

Applying this process, it is not difficult to see that the search in d>1 dimensional range trees is optimized to $ O(k + (\log n)^{d-1}) $.

k-d Tree

Range trees have high space usage and large time-space overhead for preprocessing. The k-d tree sacrifices query complexity but has a simpler structure and performs well in practice.

Construction of 2D k-d Tree

Alternately bisect the point set based on x and y coordinates, selecting the median of the coordinate values each time. Continue bisecting until there are only O(1) data points left in the category. The resulting tree is guaranteed to be balanced, i.e., with a height of O(log n). Store the bounding box coordinates at internal nodes, calling this bounding box a cell.

The construction method is as follows:

First, sort the point set by x coordinates and y coordinates to obtain two ordered vectors, add links between the two ordered vectors, taking O(n log n) time.
Divide the x ordered vector into two halves by the median of x coordinates, then scan the y ordered vector once based on the links from the y ordered vector to the x ordered vector, splitting it into two halves, taking O(n) time. At this point, the problem is split into two subproblems of size n/2. Recursively split (alternately based on x, y coordinates), assuming the time for a subproblem of size n is T(n), then T(n) = 2T(n/2) + O(n), by the master theorem, T(n) = O(n log n). (Since the vectors are ordered here, the median can be selected by index access; see [Note 2] for the general method of selecting the median)

Therefore, the time complexity of building a 2D k-d tree is O(n log n); the space complexity is O(n).

Searching in 2D k-d Tree

The method of searching a k-d tree is quite obvious: if the bounding box is contained within the range, report all objects in the bounding box; if the bounding box is partially contained within the range, recursively search the left and right children.

To derive the complexity of the above process, consider the number of sub-cells in a cell containing n elements that intersect with any line segment, denoted as Q(n). After descending two levels, there are 4 sub-cells, noting that the line segment can only intersect with two of these four sub-cells. Therefore, Q(n) = 2Q(n/4) + O(1), by the master theorem, $Q(n) = O(n^{1/2})$. Therefore, at most $4* Q(n) = O(n^{1/2})$ sub-cells in a cell containing n elements intersect with the rectangular range, so the complexity of the search process is $O(k + n^{1/2})$.

High-dimensional k-d Tree

The derivation is similar to the two-dimensional case.

The first step of the construction process requires $O(nd\log n)$. The second step of the construction process, alternately bisecting, $T(n) = 2T(n/2) + O(nd)$, by the master theorem, $T(n) = O(nd \log n)$. Therefore, the time complexity of the construction process is $O(nd\log n)$.

The space complexity of the input data is O(nk), storing the index of the first step requires O(nk) space, and since nodes need to store bounding boxes, the entire tree also requires O(nk) space.

During the search, $Q(n) = 2^{d-1}Q(n/2^d) + O(1)$, by the master theorem, $Q(n) = O(n^{1-1/d})$, so the complexity is $O(k + n^{1-1/d})$.

Summary

Data Structure	Construction Time Complexity	Space Complexity	Query Time Complexity
range tree with FC	$O(n (\log n)^{d-1})$	$O(n (\log n)^{d-1})$	$ O(k + (\log n)^{d-1}) $
kd tree	$O(nd\log n)$	$O(nd)$	$O(k + n^{1-1/d})$

Searching for Objects as Line Segments

Below are two data structures designed for searching line segments: Interval Tree and Segment Tree. Intuitively, the process of constructing an interval tree is like stabbing skewers into pieces of food; the process of constructing a segment tree is like letting a horizontal strip fall vertically onto the tree, breaking into several segments hanging at different heights on the tree.

Interval Tree

Problem: Given n one-dimensional intervals, how can we list all segments containing a certain point in O(k + log n) time? (This problem is known as a stabbing query)

Approach: Build an interval tree (Interval tree). List all interval endpoints and find the median (xm) as the root node, storing all intervals containing this point. These intervals are stored twice, sorted by left and right endpoints. The remaining intervals are either to the left or right of xm, and an interval tree is recursively built for them as the left and right children of the root node.

The space complexity of the interval tree is O(n). Without proof, it is asserted that building the interval tree takes O(n log n) time, and the tree height is O(log n).

Search method: Let the query point coordinate be xq. If xq = xm, directly output all intervals stored in the root node; otherwise, assume xq < xm, then: 1) Traverse the intervals stored in the root node sorted by left endpoint, listing all intervals until the left endpoint > xq; 2) Recursively query the interval tree of the left child. The case for xq > xm is similar.

It is not difficult to see that the query time is O(k + log n).

Segment Tree

Problem: Given N non-intersecting line segments in the plane, prepare to handle the following query: Given a new vertical line segment, return all segments that intersect with it.

Approach: Consider the x-coordinate projection of the input segments, using all endpoint x-coordinates to divide the x-axis into a series of intervals. Build a balanced binary tree (called a segment tree) with these intervals as leaf nodes. If a segment s includes node v (a segment) but does not include the parent node of v, store s in the node list of v. Since the segments do not intersect, this list can be sorted "top to bottom" (select any x-coordinate in interval v and find the corresponding y-coordinate of the segment).

Query method: Use the x-coordinate of the vertical segment to query the balanced binary tree. Each time a node is entered, perform a binary search in the list stored at that node to find and report segments that meet the y-range.

Complexity: The height of the balanced binary tree is O(log n); each segment is stored in O(log n) nodes; preprocessing time and space are O(n log n); query time is O(k + log(n)^2).

Segment Tree for Maintaining Interval Information

The "segment tree" in computational geometry allows querying which intervals contain a given point in O(log n) time. Its variant, the segment tree in programming competitions/tests, allows querying and modifying an interval in O(log n) time.

Problem: Given a sequence of length n, perform the following two operations: 1) Add k to each number in a certain (index) interval, 2) Find the sum of all numbers in a certain interval. Is there a way to complete each operation in O(log n)? The problem is from Luogu P3372 [Template] Segment Tree 1.

Approach: Build a segment tree in computational geometry, where each internal node maintains two variables: sum and lazy. If the interval to be updated includes node v but does not include the parent node of v, record the update information on v (specifically using the lazy variable). Since this mark is not immediately passed to the left and right children of v, it is called a lazy mark. At node v, maintain the sum of all numbers in the interval.

Code: Assume we build the tree in a heap-like manner, with the left and right children of node p being 2*p and 2*p+1, respectively. Let the lazy mark array be lazy and the interval sum array be sum. Due to the existence of the lazy mark, there is a gap between sum and the real interval sum, differing by the cumulative lazy mark of the ancestors multiplied by the interval length.

Invariants include:

A global relation: The sum of numbers in the interval of node p = sum[p] + the sum of lazy marks on the path back to the root * interval length
A local relation: sum[p] = sum[2*p] + sum[2*p+1] + lazy[p] * interval length of p

Interval Modification and Interval Query Operations

Interval Modification

Start recursive access from the root. If the interval and node interval have no intersection, return directly. If the interval contains the node interval, add a lazy mark and return. If the interval and node interval intersect, update sum and recursively access the left and right children.

void awardlazy(segment* tree, int i, int amount) {
    tree[i].lazy += amount;
    tree[i].sum += amount * (tree[i].r - tree[i].l + 1);
}

void rangeadd(segment* tree, int i, int lo, int hi, int amount) {
    if (lo > hi || hi < tree[i].l || lo > tree[i].r) return;
    if (lo > tree[i].l && hi < tree[i].r) tree[i].sum += (hi - lo + 1) * amount;
    else if (lo > tree[i].l) tree[i].sum += (tree[i].r - lo + 1) * amount;
    else if (hi < tree[i].r) tree[i].sum += (hi - tree[i].l + 1) * amount;
    else { // target interval contains node i
        awardlazy(tree, i, amount);
        return;
    }
    rangeadd(tree, 2*i, lo, hi, amount);
    rangeadd(tree, 2*i+1, lo, hi, amount);
}

Interval Query

Start recursive access from the root. If the interval and node interval have no intersection, return 0 directly. If the interval contains the node interval, return sum[p] directly (we ensure that when this situation occurs, all lazy marks on the path from the root to p have been cleared). If the interval and node interval intersect, "pass down" lazy[p] to its left and right children (both the lazy and sum of the left and right children need to be updated) and return.

ll lookup(segment* tree, int i, int lo, int hi) {
    if (lo > hi || hi < tree[i].l || lo > tree[i].r) return 0;
    if (lo <= tree[i].l && hi >= tree[i].r) {
        return tree[i].sum;
    }
    // need to recurse, propagate lazy mark
    awardlazy(tree, 2*i, tree[i].lazy);
    awardlazy(tree, 2*i+1, tree[i].lazy);
    tree[i].lazy = 0;
    return lookup(tree, 2*i, lo, hi) + lookup(tree, 2*i+1, lo, hi);
}

Additional Challenge: Support Multiplicative Interval Modification

Now, support a third operation: multiply each number in a certain interval by k. The problem is from Luogu P3373 [Template] Segment Tree 2.

To do this, we add a new lazy mark indicating how much to multiply. Thus, there are two lazy marks: add indicates "how much to add," and mult indicates "how much to multiply."

Assume the path from p back to the root passes through p~1, p~2, ..., (p~k represents going up k levels from p). The invariant becomes:

Global: The sum of numbers in the interval of node p = (sum[p] * mul[p~1] + len(p) * add[p~1]) * mul[p~2] + len(p) * add[p~2] ...
Local: sum[p] = (sum[2*p] + sum[2*p+1]) * mul[p] + len(p) * add[p]

Note that the order of operations is "multiply first, then add."

Method for Propagating Lazy Marks

// if has child, must propagate any existing lazy marks before adding new ones
void try_push_lazy(segment* tree, int i) {
    if (tree[i].l != tree[i].r) {
        tree[2*i].sum = (tree[2*i].sum * tree[i].lazy_mult + tree[i].lazy_add * (tree[2*i].r - tree[2*i].l + 1)) % MOD;
        tree[2*i+1].sum = (tree[2*i+1].sum * tree[i].lazy_mult + tree[i].lazy_add * (tree[2*i+1].r - tree[2*i+1].l + 1)) % MOD;
        tree[2*i].lazy_mult = (tree[2*i].lazy_mult * tree[i].lazy_mult) % MOD;
        tree[2*i+1].lazy_mult = (tree[2*i+1].lazy_mult * tree[i].lazy_mult) % MOD;
        tree[2*i].lazy_add = (tree[2*i].lazy_add * tree[i].lazy_mult + tree[i].lazy_add) % MOD;
        tree[2*i+1].lazy_add = (tree[2*i+1].lazy_add * tree[i].lazy_mult + tree[i].lazy_add) % MOD;
        // reset
        tree[i].lazy_add = 0;
        tree[i].lazy_mult = 1;
    }
}

Note that a node may have both add and mult lazy marks, which need to be propagated according to the "multiply first, then add" rule to ensure invariants hold.

Method for Interval Modification

ll rangemult(segment* tree, int i, int lo, int hi, int mult) {
    // assert that the path from root to i does not contain lazy add or mult
    if (lo > hi || hi < tree[i].l || lo > tree[i].r) return tree[i].sum;
    try_push_lazy(tree, i);
    if (lo <= tree[i].l && hi >= tree[i].r) {
        awardlazymult(tree, i, mult);
        return tree[i].sum;
    }
    tree[i].sum = (rangemult(tree, 2*i, lo, hi, mult) + rangemult(tree, 2*i+1, lo, hi, mult)) % MOD;
    return tree[i].sum;
}

This recursive call ensures that there are no lazy marks on the path from the current node to the root of the tree. To maintain this promise, lazy marks must be propagated before the recursive call.

The sum after interval modification is calculated by summing up the sum that the left and right children returns.

Notes

[Note 1] Master Theorem

Assume the running time of an algorithm with input size n satisfies T(n) = a*T(n/b) + f(n)

Case 1: $f(n) = O(n^c), c<log_b(a)$,
then $T(n) = \Theta(n^{log_b(a)})$

Case 2: $f(n) = \Theta( n^{log_b(a)} * (log(n))^k ), k \geq 0 $
then $T(n) = \Theta( n^{log_b(a)} * (log(n))^{k+1} )$

Case 3: $f(n) = \Omega(n^c), c>log_b(a)$
If there exists k < 1, such that for any sufficiently large n, $a*f(n/b) \leq kf(n)$, then $T(n) = \Theta(f(n))$

[Note 2] Quickselect Algorithm

What is the time complexity of selecting the k-th largest element from an unordered list of n elements? The naive method is sorting O(n log n), while the lower bound of time complexity is O(n), because it is impossible to assert that an element is the k-th largest before accessing all elements.

Quickselect is an algorithm with an average time of O(n). Its idea is similar to quicksort, selecting a pivot, swapping elements in the array so that elements on the left of the pivot do not exceed the pivot, and elements on the right are not less than the pivot:

# input pivot position, output new position of pivot after swap
def partition(A, lo, hi, pivot): # A[lo, hi]
    pivotvalue = A[pivot]
    A[hi], A[pivot] = A[pivot], A[hi] # swap pivot to right end
    while lo < hi:
        # fill gap at A[hi]
        while lo < hi and A[lo] <= pivotvalue:
            lo += 1
        A[hi] = A[lo]
        # fill gap at A[lo]
        while lo < hi and A[hi] >= pivotvalue:
            hi -= 1
        A[lo] = A[hi]
    A[lo] = pivotvalue
    return lo

To find the k-th smallest element, it can be solved by making A[k-1] the pivot. Continuously tentatively select pivots (the strategy is quite important), swap array elements, and examine the new position of the pivot: if it is exactly equal to k-1, the problem is solved; if it is less than k-1, we can only consider the right half of the array; if it is greater than k-1, we can only consider the left half of the array.

def kth_small(A, k):
    k = k-1 # make A[k-1] a pivot
    lo, hi = 0, len(A)-1
    while lo < hi:
        pos = partition(A, lo, hi, lo)
        if pos <= k:
            lo = pos+1
        if pos >= k:
            hi = pos-1
    return A[k]

From the above algorithm, it can be seen that while finding the k-th smallest element, the quickselect algorithm also ensures that the k-th smallest element appears at position A[k-1], and A[0..k-1) does not exceed A[k-1], A(k-1..n) is not less than A[k-1], dividing the unordered vector into two parts.

Expected Time

Let T(n) be the expected time to find the k-th smallest element from n elements. Tentatively selecting a pivot and swapping array elements, knowing that the pivot's index becomes j, thus converting it into a subproblem, shows that $T(n) \leq (n-1) + (1/n) * \Sigma_{j=0}^{n-1} T(\max(j, n-j-1)) \leq (n-1) + (2/n) * \Sigma_{j=n/2}^{n-1} T(j) $.

Assume $T(n) \leq c*n$, substitute into the above recurrence, simplify to get $ c \leq (n-1)/n + (c/4) * (2n-2-n)/n * (2n-2+n)/n \leq 1 + c*3/4 $, take c=4. Therefore, the average time of quickselect is O(n).