alg: Graph Theory Related Algorithms

2024-09-05

To reiterate, the pseudocode sections in the alg series of blog posts (including this one) are copied or adapted from the lecture notes of Tsinghua University's Data Structures course (by Deng Junhui). They can be accessed from this link.

Keywords:
Minimum Spanning Tree: Prim's Algorithm, Kruskal's Algorithm, where Kruskal's Algorithm uses the union-find data structure
Shortest Path Tree / SSSP Single Source Shortest Path: Dijkstra's Algorithm
Shortest distances between all pairs of points: Floyd Warshall Algorithm

Basic Algorithms of Graphs

Graph storage methods: Adjacency Matrix (point-to-point), Incidence Matrix (point-to-edge)

The space complexity of the adjacency matrix is $O(n^2)$. Checking the existence/adding/removing of an edge from node i to j is O(1), checking the in-degree/out-degree of node i is O(1) (with an additional maintained array), and traversing all outgoing or incoming edges of node i is O(n).

The incidence matrix, where each node stores a list recording the edges it originates (and/or ends), has a space complexity of O(n+e). Checking the existence/adding/removing of an edge from node u to v is O(deg(u)) or O(deg(v)), checking the in-degree/out-degree of node i is O(1), and traversing the outgoing/incoming edges of node v requires O(deg(v)).

Breadth-First Search BFS:

Initialize node states

for vertex v in graph:
  status[v] = "undiscovered"

Perform BFS from node s

status[s] = "discovered"
queue q
q.enqueue(s)
while queue is not empty
  v = q.dequeue()
  for every neighbor u of v:
    if status[u] == "undiscovered":
      status[u] = "discovered"
      q.enqueue(u)
      # (v, u) is an edge in the BFS tree

If the graph is connected, one BFS is sufficient; otherwise, you can traverse the node list and start BFS from any undiscovered node. This way, BFS can construct a spanning tree for each connected component.

c = 0 # number of connected components
for vertex v in graph:
  if status[v] == "undiscovered":
    BFS(v)
    c = c + 1

Since each node is enqueued only once and each edge is checked only once, the algorithm is O(n+e).

Note: The forest generated by BFS contains c trees and n-c edges in total.

Depth-First Search DFS

Perform DFS from node v

def dfs(v)
  status[v] = "discovered"
  for every neighbor u of v:
    if status[u] == "undiscovered":
      dfs(u)
      # (v, u) is an edge in the DFS tree

Rewriting the Recursive DFS Function

In my opinion, the call stack depth of the recursive DFS function above depends on the length of the longest chain in the input graph, which may cause stack overflow with special inputs. There are two approaches to rewrite it as a non-recursive function:

The first approach is to simulate the stack, storing the variables (current node number, which neighbor is being visited) in a stack that would otherwise be stored in the call stack during recursion:

vector<int> path(1, root); // stack of node numbers
vector<int> next_ids(1, 0); // progress of visiting neighbors
while !path.empty()
  cur = path.back(); next_id = next_ids.back();
  if next_id == 0 // we are visiting the current node for the first time
    visited[cur] = true
    visit cur
  if next_id == neighbors[cur].size() // all neighbors of the current node have been visited
    path.pop_back(); next_ids.pop_back();
  else
    next_ids.back()++;
    next = neighbors[cur][next_id]
    if !visited[next] // if next is not a parent of the current node
      path.push_back(next)
      next_ids.push_back(0)

The second approach is to note that DFS is a priority search where the later discovered nodes have higher priority. A stack is sufficient to manage this priority (later discovered nodes have higher priority).

stack s // use stack instead of priority queue
s.push(root)
while s is not empty
  v = s.pop()
  if visited[v] continue
  visited[v] = true
  visit v
  for every neighbor u of v:
    if !visited[u]
      (v, u) is in the DFS tree
      s.push(u)

The above program actually has a problem - for example, the number of DFS tree edges added is the same as the number of nodes pushed onto the stack, but each node may be pushed onto the stack more than once (earlier instances are discarded when accessed by the latest instance), resulting in more than N-1 tree edges, which is contradictory. Similarly, a node may have multiple "parents."

The correction method is to additionally record the parent of each node when it is pushed onto the stack; when visiting node v upon popping, record (parent, v) as an edge of the DFS tree.

Topological Sort

Interface: Given a directed graph G, attempt to arrange all its vertices into a linear sequence that is compatible with the original graph, i.e., for each directed edge (x, y), x's number should be less than y's. If G is a DAG, such a sequence should be returned; otherwise, it should report that it is not a DAG.

Lemma: A DAG must have at least one node who has zero in-degree.

Algorithm: Dynamically modify graph G and maintain a stack S storing all nodes with zero in-degree; repeatedly pop node v from the stack and remove v from G until G is empty (success) or S is empty (failure).

initialize stack S with all vertices with indegree 0
seq = []
while S is not empty:
  v = S.pop()
  seq.append(v)
  for each neighbor u of v:
    if u.indegree == 1:
      S.push(u)
  G = G \ {v} # delete v from graph
if G is empty:
  return seq
else:
  return NOT_A_DAG

Removing v from graph G, i.e., removing each outgoing edge of v, is O(deg_out(v)), so the entire algorithm is O(n+e).

Advanced Algorithms of Graphs

Priority Search

We want to traverse/search each node in the graph according to a user-defined priority, where the priority can be adjusted as the algorithm progresses. Specifically, traversing each node can only affect the priority of its adjacent nodes.

About the implementation of the priority queue:

If using a vector to implement a priority queue (the simplest and easiest method), updating the priority takes O(1) time, but traversing to find the highest priority node takes O(n) time, so the algorithm's complexity is O(n^2 + e).
If using a balanced binary search tree to implement a priority queue, updating the priority takes O(log n) time, so the algorithm's complexity is O((n+e) log n).
If using a heap to implement a priority queue, the complexity is the same, but an additional index must be maintained to record where each node is in the heap, so that the priority of the node with index i can be updated (priority increase results in sift-up, priority decrease results in sift-down, and all sift operations must simultaneously update the index) [see Appendix 1 for code].
Using a Fibonacci heap can optimize to O(e + n log n) (source, source).

Many problems or algorithms can be implemented through priority search, such as:

Prim's algorithm, Dijkstra's algorithm
A* search algorithm (the priority is the cost estimate f(n) = g(n)+h(n), prioritizing nodes with lower costs. Here, g(n) is the cost already incurred to node n, and h(n) is the estimated cost from node n to the endpoint). Dijkstra's algorithm can be seen as a special case of the A* algorithm when h(n) is zero.
BFS can be seen as a special case of Dijkstra's algorithm when all edge weights are 1; DFS can be seen as a priority search where the new node's priority equals the moment it is discovered (maintaining a global "timer").

The A* search algorithm, where g(n)+h(n) is used as the priority, is a general framework. When h(n)=0, considering only the accumulated cost, it becomes Dijkstra; when g(n)=0, ignoring the accumulated cost, it becomes the best first search algorithm: the evaluation standard of a node is determined only by the node itself, unrelated to the edges and paths in the graph. This latter type of problem is referred to as the best node problem in this article.

Minimum Spanning Tree Problem

We aim to obtain the minimum spanning tree (MST) of a connected undirected graph.

The representation of a minimum spanning tree is simple, for example, using an array A of length n, where A[i] stores the parent node number of the i-th node in the minimum spanning tree.

Prim's Algorithm

Lemma: The shortest crossing edge in any cut of a connected graph must be included in an MST. Conversely, any MST of a connected graph must pass through the shortest crossing edge of each cut.

Idea: Continuously expand a node set V_k and its corresponding spanning tree (the set can be initialized with any node), and at each step, add the shortest crossing edge in the cut (V_k; V \ V_k) and its corresponding new node.

Implementation: Use priority search, defining the priority of a node as its distance to V_k (shortest crossing edge weight). Each loop only needs to traverse the neighbors of the current node to correctly update the priority. If a crossing edge (i, j) causes j's priority to be updated, assign A[j] = i (painter's algorithm). The complexity depends on the implementation of the priority queue, as analyzed earlier regarding priority search.

Kruskal's Algorithm

Idea: Attempt each edge in order of increasing weight, adding it if it is safe (still a tree).

Implementation: Maintain a forest containing n single-node trees and 0 edges. Traverse all edges in order of increasing weight, and if their vertices come from different trees, merge the trees on both sides. The correctness is easy to explain: take the tree T on one side of the introduced edge e and V \ T as the cut of the graph, then e must be the shortest crossing edge.

How to quickly "merge" two trees (or two equivalence classes)? Use the union-find algorithm.

One approach is to maintain an array T of length n, always ensuring that the i-th node belongs to the T[i] tree, initializing T[i] = i. When traversing to edge (i, j), if T[i] != T[j], it indicates that the nodes on both sides come from different trees and can be merged. The merging method is to scan the vector T and assign all elements with the value T[i] to T[j], while also assigning A[i] = j. This merging complexity is O(n), making Kruskal's algorithm complexity O(n*e).

Another approach is to maintain an array parent of length n, representing only the relationship in the union-find, distinct from the parent-child relationship in the spanning tree, initializing parent[i] = i. The method to find the equivalence class of an element find(i) is to ascend layer by layer along parent until parent[x] == x, indicating that i belongs to class x. To merge class i into class j, simply set parent[i] = j. To control the height of the tree and prevent excessive search time, ensure that each merge attaches the shorter tree to the taller tree, and implement "path compression" during the search, connecting all nodes along the "ascent" path to the root of the ascent's endpoint tree. This way, each search time is approximately constant (inverse Ackermann function).

Shortest Distance Problem

DIJKSTRA Algorithm

The single-source shortest path problem involves finding the shortest path tree from a given node s to all other vertices. The shortest path tree can be represented by an array A of length n, where A[i] stores the parent node number of the i-th node in the shortest path tree.

Using priority search, the priority of a node is defined as the currently known shortest path distance between it and s. In each loop, only the neighbors of the current node need to be traversed to correctly update the priority. If an edge (i, j) updates the priority of j, then assign A[j] = i (painter's algorithm). The complexity is analyzed in priority search.

Correctness: The correctness of Dijkstra's algorithm relies on the following prefix property: a prefix of a shortest path is also a shortest path and does not exceed the original shortest path in length. For graphs with non-negative edge weights, it is correct (even with the presence of zero-weight edges, because nodes popped from the queue will not be enqueued again); for graphs with negative edge weights, it is incorrect.

Floyd-Warshall Algorithm

Calculates the shortest path lengths between all pairs of points. The output complexity of this problem is $O(n^2)$. Calling the Dijkstra algorithm n times with each node as the source is $O(n^3)$; the Floyd algorithm introduced below is also $O(n^3)$. However, the Floyd algorithm is simpler in form and allows for negative one-way edges (negative weight cycles are not allowed).

Number the nodes from 1 to n, and define d^k(u, v) as the shortest path length directly connected or transiting only through the first k nodes. It satisfies:

d^k(u, v)
= w(u, v) if k = 0
  min( d^{k-1}(u,v), d^{k-1}(u,k)+d^{k-1}(k,v) ) if k > 0

Note that d^{k-1}(u, k) = d^k(u, k) and d^{k-1}(k, v) = d^k(k, v), and this formula can be calculated with three nested loops from outer to inner, reusing the same storage space A[1..n][1..n] (this can be seen as state-compressed dynamic programming; in each iteration of k, the k-th row and k-th column remain unchanged):

for k in 1..n
  for u in 1..n
    for v in 1..n
      A[u][v] = min(A[u][v], A[u][k]+A[k][v])

Best Node Problem

As mentioned earlier, we want to find the "best" (minimum h(n)) node, regardless of edge weights and paths along the way. A possible real-world scenario is retrieving the vector closest to the target vector in a database of high-dimensional dense vectors, assuming the vectors in the database are organized into a graph by approximately "connecting several close neighbors."

The Best-first search algorithm attempts to solve this problem given an entry point:

use enter point to initialize priority queue
while priority queue is not empty
  pop node v with top priority
  mark v as visited
  for each neighbor u of v who is not visited:
    mark u as visited
    push u to priority queue

This will obviously traverse all nodes in the database. For retrieval efficiency, some stopping conditions can be set, such as letting u enter the queue only if u is closer than v (greedy), etc.

We may also want to retrieve the K vectors closest to the target vector, in which case we can use a heap to record K candidate nodes:

use enter point to initialize priority queue
let W be a candidate pool of max size K organized as a heap with worst on top
while priority queue is not empty
  pop node v with top priority
  mark v as visited
  if v is worse than every candidate in W:
    break
  for each neighbor u of v who is not visited:
    mark u as visited
    if u is better than the worst candidate in W
      push u to W
      push u to priority queue

We tentatively call this algorithm "best-first beam search", although it may be a misnomer because it differs from other meanings of beam search. For example, beam search in autoregressive encoders in natural language processing, the number of candidates that can continue to extend is limited to K at each step (i.e., the length of the decoded sequence). However the algorithm above does not even limit the number of elements in the priority queue, rather only limiting the size of the candidate pool W. A larger candidate pool means meeting the break condition later in time.

The above algorithm is from Algorithm 2 in Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.

Appendix

Appendix 1, Priority Queue that Can Change Existing Event Priorities

Class

class PQMod:
    """
    A priority queue that lets you:
    1) Insert event with priority
    2) Remove an existing event
    3) Pull or peek the highest priority event
    4) Change the priority of an existing event
    each operation is O(log n) time.
    """

    def __init__(self):
        # The heap invariant is heap[k] <= heap[2*k+1] and heap[k] <= heap[2*k+2]
        self.heap = list()  # Each element is a (event, priority) tuple
        self.idxof = dict()  # Maps event to index in the heap

    def _swap(self, i, j):
        # Swap two elements in the heap, update the mapping
        self.idxof[self.heap[i][0]], self.idxof[self.heap[j][0]] = j, i
        self.heap[i], self.heap[j] = self.heap[j], self.heap[i]

    def _sift_up_at(self, idx: int):
        while idx > 0 and self.heap[idx][1] < self.heap[(idx-1)//2][1]:
            self._swap(idx, (idx-1)//2)
            idx = (idx-1)//2

    def _children_status(self, idx: int):
        if 2*idx+2 <= len(self.heap)-1 and self.heap[idx][1] > self.heap[2*idx+2][1] \
           and self.heap[2*idx+1][1] > self.heap[2*idx+2][1]:
            return 2  # right child goes up
        elif 2*idx+1 <= len(self.heap)-1 and self.heap[idx][1] > self.heap[2*idx+1][1]:
            return 1  # left child goes up
        else:
            return 0  # neither child goes up

    def _sift_down_at(self, idx: int):
        c = self._children_status(idx)
        while c > 0:
            self._swap(idx, 2*idx+c)
            idx = 2*idx+c
            c = self._children_status(idx)

    def _adjust_at(self, idx: int):
        self._sift_up_at(idx)  # does nothing if should not sift up
        self._sift_down_at(idx)  # does nothing if should not sift down

    def insert(self, e, p):
        if e in self.idxof:
            raise ValueError("Duplicate Events are not allowed")
        self.heap.append((e, p))
        self.idxof[e] = len(self.heap) - 1
        self._sift_up_at(len(self.heap)-1)

    def remove(self, e):
        idx = self.idxof[e]
        self._swap(idx, len(self.heap)-1)
        del self.heap[len(self.heap)-1]
        del self.idxof[e]
        if idx == len(self.heap):  # we removed the last element
            return
        self._adjust_at(idx)

    def pull_highest(self, peek=False):
        if len(self.heap) == 0:
            raise IndexError("Cannot Pull from Empty Queue")
        e, p = self.heap[0]
        if not peek:
            self.remove(e)
        return e, p

    def change_priority(self, e, new_p):
        idx = self.idxof[e]
        self.heap[idx] = (e, new_p)
        self._adjust_at(idx)

Usage Example

def pq_mod_example():
    pq = PQMod()
    events = [("code", 1), ("sleep", 0), ("eat", 2), ("run", 5), ("drink", 3)]
    for e, p in events:
        pq.insert(e, p)
    print(pq.heap)
    print(pq.pull_highest())
    print(pq.heap)
    pq.change_priority("drink", -1)
    print(pq.heap)

Script for Pair Testing

class PQList:
    """
    Reference implementation of PQMod using a List and linear scan in O(n) time
    """

    def __init__(self):
        self.events = list()  # (event, priority) tuple

    def insert(self, e, p):
        self.events.append((e, p))

    def remove(self, event):
        for e, p in self.events:
            if e == event:
                self.events.remove((e, p))

    def pull_highest(self):
        e_hi, p_hi = self.events[0]
        for e, p in self.events:
            if p < p_hi:
                e_hi, p_hi = e, p
        self.remove(e_hi)
        return e_hi, p_hi

    def change_priority(self, e, new_p):
        for i in range(len(self.events)):
            if self.events[i][0] == e:
                self.events[i] = (e, new_p)

def p_generator(n):
    # Generate distinct priorities
    # When there exists same priorties, the results of PQMod and PQList may differ
    import random
    ps = list(range(n))
    random.shuffle(ps)
    for p in ps:
        yield p

def pq_mod_test(n):
    import random
    pq = PQMod()
    pq_ref = PQList()
    cnt = (2*n)//3
    p_gen = p_generator(100+5*n)  # big enough so that it does not run out
    priorities = [next(p_gen) for _ in range(cnt)]
    events = [str(i) for i in range(cnt)]
    print("events: " + str(events))
    print("priorities: "+str(priorities))
    for i in range(cnt):
        pq.insert(events[i], priorities[i])
        pq_ref.insert(events[i], priorities[i])
    # print(pq.heap)
    while len(pq.heap) < n:
        r = random.random()
        if r < 0.3:
            p = next(p_gen)
            print("inserting event "+str(cnt)+" with priority "+str(p))
            pq.insert(str(cnt), p)
            pq_ref.insert(str(cnt), p)
            events.append(str(cnt))
            cnt += 1
        elif r < 0.8:
            e = random.choice(events)
            new_p = next(p_gen)
            print("changing event "+e+" 's priority to "+str(new_p))
            pq.change_priority(str(e), new_p)
            pq_ref.change_priority(str(e), new_p)
        elif r < 0.9:
            if len(pq.heap) > 0:
                e, p = pq.pull_highest()
                e_ref, p_ref = pq_ref.pull_highest()
                print("pulled highest priority event " +
                      e+" with priority "+str(p))
                try:
                    assert e == e_ref
                    assert p == p_ref
                except AssertionError:
                    print("ERROR: e, e_ref, p, p_ref is")
                    print(e, e_ref)
                    print(p, p_ref)
                    exit()
                events.remove(e)
        else:
            if len(pq.heap) > 0:
                e = random.choice(events)
                print("cancelling event "+e)
                events.remove(e)
                pq.remove(e)
                pq_ref.remove(e)
        # print(pq.heap)
        # print(pq_ref.events)
    print("cleaning...")
    while len(pq.heap) > 0:
        e, p = pq.pull_highest()
        e_ref, p_ref = pq_ref.pull_highest()
        print("pulled highest priority event "+e+" with priority "+str(p))
        try:
            assert e == e_ref
            assert p == p_ref
        except AssertionError:
            print("ERROR: e, e_ref, p, p_ref is")
            print(e, e_ref)
            print(p, p_ref)
            exit()
        # print(pq.heap)
        # print(pq_ref.events)
    print("TEST PASSED")

pq_mod_test(300)