Canonicalising the tree reduction

4.4. Canonicalising the tree reduction

The reduction of graph matching to ternary trees from the previous section is a big step towards an algorithm for graph matching. However, Proposition 4.5 is expressed in terms of existence of PSGs – it is as yet unclear how the trees can be constructed. This is the purpose of this section.

We introduce for this purpose a canonical, that is, invariant under isomorphism, choice of PSG GπG^\pi of GG. The result is a unique canonical transformation GGπc(Gπ)G \mapsto G^\pi \mapsto c(G^\pi) from GG to a cPSG that we can use for pattern matching.

We proceed by using the total order that we have defined on port labels and can be extended lexicographically to paths outgoing from a shared root operation (see section 4.2 for more details). Whenever more than one path from rr to oo exist in GG, it suffices to consider the smallest one. For a choice of root operation rO,r \in O, we thus obtain a total order of all operations OO in GG.

We then restrict our attention to operations on two linear paths and consider them in order. We keep track of linear paths that have been visited and proceed as follows to determine whether oOo \in O must be split:

  • if oo is on a linear path that was not seen before, it is left unchanged and the set of visited linear paths is updated;
  • otherwise, i.e. oo is on two linear paths that have already been visited, the operation is split, resulting in two operations on a single linear path.

The pseudocode CanonicalPathSplit implements this algorithm. We use Operations(G) to retrieve all the operations on the graph G and LinearPaths(G, op) to retrieve the linear paths of the operation op. The linear paths are identified using integer indices that can be pre-computed and stored in linear time in the graph size. SplitOperation(G, op) returns the graph resulting from splitting op into two operations on a single linear path. Finally, PathAsPortLabels(G, root, v) returns the string of the port labels that encode the path from root to v in the graph G. The strings are ordered lexicographically. The non-capitalized functions set, union, sort1, len, and issubset have their standard meanings.

 1def CanonicalPathSplit(G: Graph, root: Operation) -> Graph:
 2  new_G := G
 3  all_operations := Operations(G)
 4  sorted_operations := sort(
 5      all_operations,
 6      sort_key= lambda v: PathAsPortLabels(G, root, v)
 7  )
 8
 9  # keep track of the visited linear paths
10  seen_paths := set()
11  for op in sorted_operations:
12    # Get the (pre-computed) indices of the linear paths
13    op_linear_paths := LinearPaths(G, op)
14    if len(op_linear_paths) == 2:
15      if issubset(op_linear_paths, seen_paths):
16        # The two linear paths of `op` are already visited
17        new_G = SplitOperation(new_G, op)
18      else:
19        # Mark the new linear paths as visited
20        seen_paths = union(seen_paths, op_linear_paths)
21  return new_G

The following figure shows an example of splitting a graph into its canonical PSG using CanonicalPathSplit.

Splitting a graph into its canonical PSG. Ports are ordered counter-clockwise on each edge, and numbered according to the lexicographic order of the paths from root to the port, as returned by PathAsPortLabels. This induces an order on the hyperedges, reflected in the alphabetic order of the edge labels. Linear paths are formed by ports in a horizontal line (as marked by the dotted lines). Vertex root is chosen as the root of the canoncal splitting. Vertices d and g are not split because they are the smallest edges that contain the fourth, respectively first linear path.

Splitting a graph into its canonical PSG. Ports are ordered counter-clockwise on each edge, and numbered according to the lexicographic order of the paths from root to the port, as returned by PathAsPortLabels. This induces an order on the hyperedges, reflected in the alphabetic order of the edge labels. Linear paths are formed by ports in a horizontal line (as marked by the dotted lines). Vertex root is chosen as the root of the canoncal splitting. Vertices d and g are not split because they are the smallest edges that contain the fourth, respectively first linear path.

Proposition 4.7Correctness of CanonicalPathSplit
For a graph GG, the graph returned by CanonicalPathSplit(G) is a valid PSG of G.G. It is deterministic and invariant under isomorphism of GG. The runtime of CanonicalPathSplit is O(G)O(|G|), where G|G| is the number of operations in the graph GG.
Proof

Let GπG^\pi be the graph returned by CanonicalPathSplit(G). From the discussion in the proof of Proposition 4.6, we know it is sufficient to show that the interaction graph I\mathcal{I} of GπG^\pi is acyclic and connected.

I\mathcal{I} is acyclic. If there was a cycle in I\mathcal{I}, then there would be operations o0,,ok1o_0, \dots, o_{k-1} in GG that pairwise (oi,oi+1modk)(o_i, o_{i+1\, mod\, k}) share a linear path. One of these operations must be considered last in the for loop of lines 11–20, suppose it is ok1o_{k-1}. But every linear path of ok1o_{k-1} is either also a linear path of ok2o_{k-2} or a linear path of o1o_{1}: ok1o_{k-1} thus does not satisfy the condition on line 15, and thus cannot be in I\mathcal{I}, a contradiction. Hence I\mathcal{I} is acyclic.

I\mathcal{I} is connected. We proceed inductively to show the following invariant for the main for loop (lines 11–20): for all linear paths in seen_paths, there is a path in I\mathcal{I} to a linear path of the root operation. seen_paths is only modified on line 20. If op is the root operation, then trivially there is a path from the linear paths op_linear_paths to a linear path of the root operation. Otherwise, we claim that there must be one of the paths in op_linear_paths that is already in seen_paths. From there it follows that there is a path in I\mathcal{I} from the root path to the unseen linear path, given by the path to the linear path in seen_path followed by the edge in I\mathcal{I} that corresponds to op.

By connectedness of GG, there is a path from the root operation to op. The path is not empty because op is not the root operation, so we can consider the prefix of the path of all operations excluding op. Call op' the last operation preceding op and op_linear_paths' its linear paths. Two successive operations on a path must share a linear path: op_linear_paths \cap op_linear_paths' cannot be empty. According to line 4, op' must have been visited before op, thus op_linear_paths' \subseteq seen_paths. It follows that at least one element of op_linear_paths must be in seen_paths.

Determinstic and isomorphism invariant. The pseudocode above is deterministic and only depends on paths in GG encoded as strings of port labels, which are invariant under isomorphism.

Runtime complexity. Lines 2 and 3 run in O(G)O(|G|) time. With the exception of the sort function on lines 4–7, every other line can be run in O(1)O(1) time:

  • lines 13 and 15 run in constant time because the size of op_linear_paths is always at most 2;
  • line 20 (and the in check on line 15) can be run in constant time by representing the seen_paths set as a fixed-size boolean array of size ww, with the ii-th bit indicating whether the ii-th linear path has been seen;
  • line 17 is a constant time transformation if we allow in-place modification of new_G.

The for loop will run G|G| iterations, for a total of O(G)O(|G|) runtime. Finally, the sorting operation would naively take time O(GlogG)O(|G| \log |G|). However, given that the ordering is obtained lexicographically from the paths starting at the root, we can obtain the sorted list of operations by depth-first traversal of the graph starting at the root. The result follows.

Using CanonicalPathSplit, we can now sketch what the pattern matching algorithm should look like. For each pattern, we first compute their canonical PSG for an arbirary choice of pattern root operation; then, given a graph GG, we can find all embeddings of patterns into GG by iterating over all possible PSGs within GG. Naively, this involves enumerating all posible subgraphs of GG, and then for each of them, iterating over all possible root choices.

This can be significantly sped up by realising that many of the PSGs that are computed when iterating over all possible subgraphs and root choices are redundant2. We will see in the next section that we can i) iterate once over all possible root choices rr in GG and ii) introduce a new procedure AllPathSplits that will efficiently enumerate all possible rooted ual trees of PSGs that are rooted in rr for subgraphs within GG. In the process, we will also see that we can replace the tree equality check of line 12 with a subtree inclusion check, further reducing the number of PSGs that must be considered.

Naive pattern matching.

 1# Precompute all PSGs
 2allT = [CanonicalPathSplit(
 3    P, root_P
 4) for (P, root_P) in patterns]
 5
 6for S in Subgraphs(G):
 7  for root_S in Operations(S):
 8    TG = CanonicalPathSplit(
 9        S, root_S
10    )
11    for T in allT:
12      if T == TG:
13        yield T

Improved using AllPathSplits (section 4.5).

 1# Precompute all PSGs
 2allT = [CanonicalPathSplit(
 3    P, root_P
 4) for (P, root_P) in patterns]
 5
 6for root_G in Operations(G):
 7  for TG in AllPathSplits(
 8      G, root_G
 9  ):
10    for T in allT
11      # Replace == with subtree
12      if IsSubTree(T, TG)
13        yield T

  1. The sort_key parameter of the sort function defines the total order according to which the elements are sorted, from smallest to largest. ↩︎

  2. Think for example of the same root operation rr that is considered repeatedly for every overlapping subgraph of GG that contains rr↩︎