4.4. Canonicalising the tree reduction
The reduction of graph matching to ternary trees from the previous section is a big step towards an algorithm for graph matching. However, Proposition 4.5 is expressed in terms of existence of PSGs – it is as yet unclear how the trees can be constructed. This is the purpose of this section.
We introduce for this purpose a canonical, that is, invariant under isomorphism, choice of PSG of . The result is a unique canonical transformation from to a cPSG that we can use for pattern matching.
We proceed by using the total order that we have defined on port labels and can be extended lexicographically to paths outgoing from a shared root operation (see section 4.2 for more details). Whenever more than one path from to exist in , it suffices to consider the smallest one. For a choice of root operation we thus obtain a total order of all operations in .
We then restrict our attention to operations on two linear paths and consider them in order. We keep track of linear paths that have been visited and proceed as follows to determine whether must be split:
- if is on a linear path that was not seen before, it is left unchanged and the set of visited linear paths is updated;
 - otherwise, i.e. is on two linear paths that have already been visited, the operation is split, resulting in two operations on a single linear path.
 
The pseudocode CanonicalPathSplit implements this algorithm. We use
Operations(G) to retrieve all the operations on the graph G and
LinearPaths(G, op) to retrieve the linear paths of the operation op. The
linear paths are identified using integer indices that can be pre-computed and
stored in linear time in the graph size. SplitOperation(G, op) returns the
graph resulting from splitting op into two operations on a single linear path.
Finally, PathAsPortLabels(G, root, v) returns the string of the port labels
that encode the path from root to v in the graph G. The strings are
ordered lexicographically. The non-capitalized functions set, union,
sort1, len, and issubset have their standard meanings.
 1def CanonicalPathSplit(G: Graph, root: Operation) -> Graph:
 2  new_G := G
 3  all_operations := Operations(G)
 4  sorted_operations := sort(
 5      all_operations,
 6      sort_key= lambda v: PathAsPortLabels(G, root, v)
 7  )
 8
 9  # keep track of the visited linear paths
10  seen_paths := set()
11  for op in sorted_operations:
12    # Get the (pre-computed) indices of the linear paths
13    op_linear_paths := LinearPaths(G, op)
14    if len(op_linear_paths) == 2:
15      if issubset(op_linear_paths, seen_paths):
16        # The two linear paths of `op` are already visited
17        new_G = SplitOperation(new_G, op)
18      else:
19        # Mark the new linear paths as visited
20        seen_paths = union(seen_paths, op_linear_paths)
21  return new_G
The following figure shows an example of splitting a graph into its canonical
PSG using CanonicalPathSplit.
Splitting a graph into its canonical PSG. Ports are ordered counter-clockwise on each edge, and numbered according to the lexicographic order of the paths from root to the port, as returned by PathAsPortLabels. This induces an order on the hyperedges, reflected in the alphabetic order of the edge labels. Linear paths are formed by ports in a horizontal line (as marked by the dotted lines). Vertex root is chosen as the root of the canoncal splitting. Vertices d and g are not split because they are the smallest edges that contain the fourth, respectively first linear path.
CanonicalPathSplitCanonicalPathSplit(G) is a valid PSG of
 It is deterministic and invariant under isomorphism of . The runtime of
CanonicalPathSplit is , where  is the number of operations in the
graph .Proof
Let  be the graph returned by CanonicalPathSplit(G). From the
discussion in the proof of Proposition 4.6, we know
it is sufficient to show that the interaction graph  of  is
acyclic and connected.
 is acyclic. If there was a cycle in , then there
would be operations  in  that pairwise
 share a linear path. One of these operations must be
considered last in the for loop of lines 11–20, suppose it is . But
every linear path of  is either also a linear path of  or a
linear path of :  thus does not satisfy the condition on line
15, and thus cannot be in , a contradiction. Hence  is
acyclic.
 is connected. We proceed inductively to show the following
invariant for the main for loop (lines 11–20): for all linear paths in
seen_paths, there is a path in  to a linear path of the root
operation. seen_paths is only modified on line 20. If op is the root
operation, then trivially there is a path from the linear paths
op_linear_paths to a linear path of the root operation. Otherwise, we claim
that there must be one of the paths in op_linear_paths that is already in
seen_paths. From there it follows that there is a path in  from
the root path to the unseen linear path, given by the path to the linear path in
seen_path followed by the edge in  that corresponds to op.
By connectedness of , there is a path from the root operation to op. The
path is not empty because op is not the root operation, so we can consider the
prefix of the path of all operations excluding op. Call op' the last
operation preceding op and op_linear_paths' its linear paths. Two successive
operations on a path must share a linear path: op_linear_paths 
op_linear_paths' cannot be empty. According to line 4, op' must have been
visited before op, thus op_linear_paths'  seen_paths. It
follows that at least one element of op_linear_paths must be in seen_paths.
Determinstic and isomorphism invariant. The pseudocode above is deterministic and only depends on paths in encoded as strings of port labels, which are invariant under isomorphism.
Runtime complexity. Lines 2 and 3 run in  time. With the exception of
the sort function on lines 4–7, every other line can be run in  time:
- lines 13 and 15 run in constant time because the size of 
op_linear_pathsis always at most 2; - line 20 (and the 
incheck on line 15) can be run in constant time by representing theseen_pathsset as a fixed-size boolean array of size , with the -th bit indicating whether the -th linear path has been seen; - line 17 is a constant time transformation if we allow in-place modification of
new_G. 
The for loop will run  iterations, for a total of  runtime.
Finally, the sorting operation would naively take time .
However, given that the ordering is obtained lexicographically from the paths
starting at the root, we can obtain the sorted list of operations by depth-first
traversal of the graph starting at the root. The result follows.
Using CanonicalPathSplit, we can now sketch what the pattern matching
algorithm should look like. For each pattern, we first compute their canonical
PSG for an arbirary choice of pattern root operation; then, given a graph ,
we can find all embeddings of patterns into  by iterating over all possible
PSGs within . Naively, this involves enumerating all posible subgraphs of
, and then for each of them, iterating over all possible root choices.
This can be significantly sped up by realising that many of the PSGs that are
computed when iterating over all possible subgraphs and root choices are
redundant2. We will see in the next section that we can i)
iterate once over all possible root choices  in  and ii) introduce a new
procedure AllPathSplits that will efficiently enumerate all possible rooted
ual trees of PSGs that are rooted in  for subgraphs within . In the
process, we will also see that we can replace the tree equality check of line 12
with a subtree inclusion check, further reducing the number of PSGs that must be
considered.
Naive pattern matching.
 1# Precompute all PSGs
 2allT = [CanonicalPathSplit(
 3    P, root_P
 4) for (P, root_P) in patterns]
 5
 6for S in Subgraphs(G):
 7  for root_S in Operations(S):
 8    TG = CanonicalPathSplit(
 9        S, root_S
10    )
11    for T in allT:
12      if T == TG:
13        yield T
Improved using AllPathSplits (section 4.5).
 1# Precompute all PSGs
 2allT = [CanonicalPathSplit(
 3    P, root_P
 4) for (P, root_P) in patterns]
 5
 6for root_G in Operations(G):
 7  for TG in AllPathSplits(
 8      G, root_G
 9  ):
10    for T in allT
11      # Replace == with subtree
12      if IsSubTree(T, TG)
13        yield T