Enumerating all path-split graphs

4.5. Enumerating all path-split graphs

The CanonicalPathSplit procedure in the previous section defines for all graphs GG and choice of root operation rr a canonical PSG GπG^\pi, and thus a canonical set of anchors π\pi that we write as πr(G)=π.\pi_r(G) = \pi.

Instead of CanonicalPathSplit, we can equivalently consider a CanonicalAnchors procedure, which computes πr(G)\pi_r(G) directly instead of the graph Gπr(G)G^{\pi_r(G)}.

We formulate this computation below, using recursion instead of a for loop. This form generalises better to the AllAnchors procedure that we will introduce next.

The equivalence of the CanonicalAnchors procedure with CanonicalPathSplit follows from the observation made in section 4.2 that ordering operations in lexicographic order of the port labels is equivalent to a depth-first traversal of the graph.

CanonicalAnchors implements a recursive depth-first traversal (DFS), with the twist that the recursion is explicit only on the anchor nodes and otherwise relying on the lexicographic ordering just like in CanonicalPathSplit: lines 5–15 of CanonicalAnchors correspond to the iterations of the for loop (line 11–20) of CanonicalPathSplit until an anchor operation is found (i.e. the else branch on lines 18–20 is executed). From there, the graph traversal proceeds recursively.

We introduce the ConnectedComponent, Neighbours and RemoveOperation procedures; the first returns the connected component of the current operation, whereas the other two procedures are used to traverse, respectively modify, the graph GG. Importantly, Neighbours(G, op) returns the neighbours of op ordered by port label order.

To ensure that the recursive DFS does not visit the same operation twice, we modify the graph with RemoveOperation on lines 11 and 15, ensuring that no visited operation remains in G. As a consequence, CanonicalAnchors may be called on disconnected graphs, which explains why an additional call to ConnectedComponent (line 4) is required.

Proposition 4.8Equivalence of CanonicalPathSplit and CanonicalAnchors

Let GG be a connected graph and let rr be a root operation in GG. Then CanonicalAnchors maps the graph to the canonical anchor set:

(G,r,{})(π(G)r,L,),(G, r, \{\}) \mapsto (\pi(G)_r, L, \varnothing),

where LL is the set of all paths in GG and \varnothing designates the empty graph.

The proof follows directly from the previous paragraphs.

 1def CanonicalAnchors(
 2    G: Graph, root: Operation, seen_paths: Set[int]
 3) -> (Set[Operation], Set[int], Graph):
 4  operations = Operations(ConnectedComponent(G, root))
 5  # sort by PathAsPortLabels, as previously
 6  sorted_operations := sort(operations)
 7  operations_queue := queue(sorted_operations)
 8
 9  # Skip all operations that are not anchors
10  op := operations_queue.pop() # never emtpy, contains root
11  G = RemoveOperation(G, op)
12  while len(LinearPaths(G, op)) == 1 or
13        issubset(LinearPaths(G, op), seen_paths):
14    op = operations_queue.pop() or return ({}, {}, G)
15    G = RemoveOperation(G, op)
16
17  # op is anchor, update seen_paths and recurse
18  seen_paths = union(seen_paths, LinearPaths(G, op))
19  anchors := [op]
20  # sort by port labels
21  for child in Neighbours(G, op):
22    (new_anchors, seen_paths, G) = CanonicalAnchors(
23        G, child, seen_paths
24    )
25    anchors += new_anchors
26
27  return (anchors, seen_paths, G)

Maximal PSGs #

In addition to “simplifying” the data required to define path splitting, the definition of PSGs using anchor operations has another advantage that is fundamental to the pattern matching algorithm.

Consider the rooted dual tree τr(Gπ)\tau_r(G^\pi) of a PSG with root operation rr in GπG^\pi. Recall that tree nodes are uniquely identified by their path from the root and thus are considered equal if they are isomorphic as graphs. We can in the same way define a tree inclusion relation \subseteq on rooted dual trees that corresponds to checking that the trees have the same root and that the left-hand side is isomorphic to a subtree of the right-hand side. We also require that the operation weights given by the splitsplit map splitsplit map coincide on the common subtree.

Proposition 4.9Maximal PSG

Let GG be a connected graph, π\pi a set of operations in GG and rπr \in \pi a root operation. Consider the set Gπ={HGπr(H)=π}.\mathcal{G}_\pi = \{H \subseteq G \mid \pi_r(H) = \pi \}.

There is a subgraph MGM \subseteq G such that for all subgraphs HGXH \in \mathcal{G}_X: HMH \subseteq M. Furthermore, for all graph PP, there is rr' and π=π(P)r\pi' = \pi(P)_{r'} such that

PHGXτr(Pπ)τr(Mπ).P \simeq H \in \mathcal{G}_X \quad\Leftrightarrow\quad \tau_{r'}(P^{\pi'}) \subseteq \tau_r(M^\pi).

We call MπM^\pi the maximal PSG with anchors π\pi in GG.

The proof gives an explicit construction for MM.

Proof

Assume GX\mathcal{G}_X \neq \varnothing, otherwise the statement is trivial.

Construction of MM. Let LL be the set of linear paths in GG that go through at least one operation in π\pi. Consider the set of operations OLO_L in GG given by the operations whose linear paths are contained in LL. This defines a subgraph GOLG|_{O_L} of GG. Since GX\mathcal{G}_X \neq \varnothing, there exists HGXH \in \mathcal{G}_X. By assumption, HH is connected, and thus the anchors π\pi of HH are connected in HH. There is therefore a connected component MGOLM \subseteq G|_{O_L} that contains the set π\pi.

Well-definedness of MπM^\pi. Consider the PSG MπM^\pi of MM. We must show that MπM^\pi is a tree-like graph for the proposition statement to be well-defined. In other words, we must show that the interaction graph I\mathcal{I} of MπM^\pi is acyclic and connected. MM is connected by construction, which implies connectedness of MπM^\pi and thus of I\mathcal{I}. It is acyclic because width(M)=π+1width(M) = |\pi| + 1 and MM has exactly π|\pi| operations on more than one linear path. I\mathcal{I} is a thus a tree.

HMH \subseteq M. For any subgraph HGXH \in \mathcal{G}_X, its operations must be contained in OLO_L. Since any HGH \in \mathcal{G} is connected and contains π\pi, it must further hold that HMH \subseteq M.

We can now prove the \Leftrightarrow equivalence of (1).

\Leftarrow: If τr(Pπ)τr(Mπ)\tau_{r'}(P^{\pi'}) \subseteq \tau_r(M^\pi), then there exists HMπH' \subseteq M^\pi with rooted dual tree

τr(Hπ)=τr(Pπ).\tau_r(H^\pi) = \tau_r(P^{\pi'}).

Furthermore, by definition of \subseteq on rooted trees, a splitsplit map is defined on HH', given by the splitsplit map of MπM^\pi on the domain HH'. Recall from section 4.3 that there is a map σ\sigma that maps

(H,split)σHand(Mπ,split)σM.(H', split) \overset{\sigma}{\longmapsto} H\quad\textrm{and}\quad(M^\pi, split) \overset{\sigma}{\longmapsto} M.
It merges split operations pairwise, and thus it is immediate that HMπH' \subseteq M^\pi implies HMH \subseteq M. Thus HGXH \in \mathcal{G}_X and H=HπH' = H^\pi. By construction, one can also derive that PHP \simeq H. The statement follows.

\Rightarrow: Since HGXH \in \mathcal{G}_X, we know from point 1 that HMH \subseteq M. Thus we can define an injective embedding φ:PM\varphi: P \to M.

Operation splitting leaves the set of values from HH to HπH^\pi, as well as from MM to MπM^\pi unchanged. Similarly, there is a bijection between values in HπH^\pi and MπM^\pi and thus between edges in τr(Hπ)\tau_r(H^\pi) and τr(Mπ)\tau_r(M^\pi). The pattern embedding φ\varphi hence defines an injective map ϕE\phi_E from tree edges in τr(Hπ)\tau_r(H^\pi) to tree edges in τr(Mπ)\tau_r(M^\pi). We extend this map to a map on the trees ϕ:τr(Hπ)τr(Mπ)\phi: \tau_r(H^\pi) \to \tau_r(M^\pi) by induction over the nodes set of τr(Hπ)\tau_r(H^\pi). We start by the root map ϕ(r)=r\phi(r) = r. Using ϕE\phi_E, we can then uniquely define the image of any child node of rr in τr(Hπ)\tau_r(H^\pi), and so forth inductively.

We show now that the map ϕ\phi thus defined is injective. Suppose v,vv, v' are nodes in τr(Hπ)\tau_r(H^\pi) such that ϕ(v)=ϕ(v)\phi(v) = \phi(v'). By the inductive construction there are paths from the root rr to vv and vv' respectively such that their image under ϕE\phi_E are two paths from rr to ϕ(v)=ϕ(v)\phi(v) = \phi(v'). But τr(Mπ)\tau_r(M^\pi) is a tree, so both paths must be equal. By bijectivity of ϕE\phi_E, it follows v=vv = v', and thus ϕ\phi is injective. Finally, the value and operation weights are invariant under pattern embedding and thus are preserved by definition.

This result means that instead of listing all PSGs for every possible subgraph of GG, it is sufficient to proceed as follows:

  1. for every pattern PP, fix a root operation rPr_P and construct the rooted tree dual of the canonical PSG
    τP:=τrP(PπrP(P)).\tau_P := \tau_{r_P}(P^{\pi_{r_P}(P)}).
  2. enumerate every possible root operation rr in GG,
  3. enumerate every possible sets of anchors π\pi in GG with root rr,
  4. for each set π\pi, find the maximal PSG MM with anchors π\pi in GG, and take its rooted tree dual τM:=τr(Mπ)\tau_M := \tau_r(M^\pi),
  5. find all patterns PP such that τPτM\tau_P \subseteq \tau_M.

In other words, if AllAnchors is a procedure that enumerates all possible sets of anchors π\pi in GG and MaximalPathSplit computes the maximal PSG MM as presented in the proof of Proposition 4.9, then AllPathSplits(G) can simply be obtained by calling AllAnchors and then returning their respective maximal PSGs in GG:

def AllPathSplits(G: Graph, root: Operation) -> Set[Graph]:
  all_anchors = AllAnchors(G, root)
  return {MaximalPathSplit(G, pi) for pi in all_anchors}

The missing piece: AllAnchors #

We can now complete the definition of AllPathSplits by defining the AllAnchors procedure, which enumerates all possible sets of anchors in GG given a root operation rr.

The procedure is similar to CanonicalAnchors, described in detail in the previous paragraphs. In addition to the arguments of CanonicalAnchors, AllAnchors requires a width w1w \geq 1 argument. It then returns all sets of at most ww operations1 that form the canonical anchors of some width-ww subgraph of GG with root rr. The main difference between CanonicalAnchors and AllAnchors is that the successive recursive calls (line 22 in CanonicalAnchors) are replaced by a series of nested loops (lines 42–48 in AllAnchors) that exhaustively iterate over the possible outcomes for different subgraphs of GG. The results of every possible combination of recursive calls are then collected into a list of anchor sets, which is returned.

The part of the pseudocode that is without comments is unchanged from CanonicalAnchors. Using Proposition 4.3, we know that we can assume that every operation has at most 3 children, and thus 3 neighbours in G, given that the operations equivalent to parent nodes were removed.

 1def AllAnchors(
 2    G: Graph, root: Operation, w: int,
 3    seen_paths: Set[int] = {}
 4) -> List[(Set[Operation], Set[int], Graph)]:
 5  # Base case: return one empty anchor list
 6  if w == 0:
 7    return [({}, {}, G)]
 8
 9  operations = Operations(ConnectedComponent(G, root))
10  sorted_operations := sort(operations)
11  operations_queue := queue(sorted_operations)
12
13  op := operations_queue.pop()
14  G = RemoveOperation(G, op)
15  while len(LinearPaths(G, op)) == 1 or
16        issubset(LinearPaths(G, op), seen_paths):
17    op = operations_queue.pop() or return [({}, {}, G)]
18    G = RemoveOperation(G, op)
19
20  seen0 = union(seen_paths, LinearPaths(G, op))
21  # There are always at most three neighbours: we
22  # unroll the for loop of CanonicalAnchors.
23  [child1, child2, child3] = Neighbours(G, op)
24  # Iterate over all ways to split w-1 anchors over
25  # the three children and solve recursively
26  all_anchors = []
27  for 0 <= w1, w2, w3 < w with w1 + w2 + w3 == w - 1:
28    for (anchors1, seen1, G1) in
29        AllAnchors(G, child1, w1, seen0):
30      for (anchors2, seen2, G2) in
31          AllAnchors(G1, child2, w2, seen1):
32        for (anchors3, seen3, G3) in
33            AllAnchors(G2, child3, w3, seen2):
34          # Concatenate new anchor with anchors from all paths
35          anchors = union([op], anchors1, anchors2, anchors3)
36          all_anchors.push((anchors, seen3, G3))
37  return all_anchors

We can represent the sequence of recursive calls to AllAnchors as a tree. The call tree for the graph used as example to illustrate CanonicalAnchors earlier is given on the next page.

We now show correctness of the procedure. Let us write Πrw(G)\Pi_r^w(G) for the set of sets of anchors returned by AllAnchors(G, r, w, {}).

Proposition 4.11Correctness of AllAnchors

Let GG be a graph and HGH \subseteq G be a subgraph of GG of width ww. Let rr be a choice of root operation in HH. We have πr(H)Πrw(G).\pi_r(H) \subseteq \Pi_r^w(G).

A call tree for an execution of AllAnchors on the example graph of the previous figure with w=3w = 3w=3. Starting from the root, each node in the tree corresponds to either picking an operation as anchor or not (thus splitting it). Edges are labelled by the values assigned to www for the respective children of the source node. One path from root to leaf leads to no solution (it is impossible to find an unseen linear path from operation bbb. The other paths each lead to a valid set of three anchors.

A call tree for an execution of AllAnchors on the example graph of the previous figure with w=3w = 3. Starting from the root, each node in the tree corresponds to either picking an operation as anchor or not (thus splitting it). Edges are labelled by the values assigned to ww for the respective children of the source node. One path from root to leaf leads to no solution (it is impossible to find an unseen linear path from operation bb. The other paths each lead to a valid set of three anchors.

The proof is by induction over the width ww of the subgraph HH. The idea is to map every recursive call in CanonicalAnchors to one of the calls to AllAnchors on lines 29, 31 or 33. All recursive results are concatenated on line 36, and thus, the value returned by CanonicalAnchors will be one of the anchor sets in the list returned by AllAnchors.

Proof

Let HGH \subseteq G be a connected subgraph of GG of width ww. We prove inductively over ww that if (X,S,H)=(X, S', H') = CanonicalAnchors$(H, r,S)$ then there is a graph GG' such that HGGH' \subseteq G' \subseteq G such that

(X,S,G)(X, S', G') \in AllAnchors(G,r,w,S)(G, r, w, S)

for all valid root operations rr of HH and all subsets of the linear paths of HH in seen_paths. The statement in the proposition directly follows this claim.

For the base case w=1w = 1, CanonicalAnchors will return the anchors anchors = [op] as defined on line 19: there is only one linear path, and it is already in seen_paths, thus for every recursive call to CanonicalAnchors, the while condition on line 12 will always be satisfied until all operations have been exhausted and empty sets are returned. In AllAnchors, on the other hand, The only values of w1, w2 and w3 that satisfy the loop condition on line 27 for w=1w = 1 are w1 == w2 == w3 =0= 0. As a result, given the w =0=0 base case on lines 6–7, the lines 35 and 36 of AllAnchors are only executed once, and the definition of anchors on line 36 is equivalent to its definition in CanonicalAnchors.

We now prove the claim for w>1w > 1 by induction. As documented in AllAnchors, we can assume that every operation has at most 3 children. This simplifies the loop on lines 21–25 of CanonicalAnchors to, at most, three calls to CanonicalAnchors.

Consider a call to CanonicalAnchors for a graph HGH \subseteq G, a root operation rr in HH and a set SS of linear paths. Let waw_a, wbw_b and wcw_c be the length of the values returned by the three recursive calls to CanonicalAnchors of line 22 for the execution of CanonicalAnchors with arguments HH, rr and SS. Let ca,cbc_a, c_b and ccc_c be the three neighbours of rr in HH. If the child cxc_x does not exist, then one can set wx=0w_x = 0 and it can be ignored – the argument below still holds in that case. The definition of seen0 on line 20 in AllAnchors coincides with the update to the variable seen_paths on line 18 of CanonicalAnchors; similarly, the updates to G on lines 14 and 18 of AllAnchors are identical to the lines 11 and 15 of CanonicalAnchors that update H. Let the updated seen_paths be the set SaS_a, the updated G be GaG_a and the updated HH be HaH_a, with HaGaH_a \subseteq G_a.

As every anchor operation reduces the number of unseen linear paths by exactly one (using the simplifying assumptions of section 4.2), it must hold that wa+wb+wc+1=ww_a + w_b + w_c + 1 = w. Thus, for a call to AllAnchors with the arguments GG, rr, ww and SS, there is an iteration of the for loop on line 27 of AllAnchors such that w1 =wa= w_a, w2 =wb= w_b and w3 =wc= w_c. It follows that on line 29 of AllAnchors, the procedure is called recursively with arguments (Ga,ca,wa,Sa)(G_a, c_a, w_a, S_a). From the induction hypothesis, we obtain that there is an iteration of the for loop on line 29 in which the values of anchors1 and seen1 coincide with the values of the new_anchors and seen_paths variables after the first iteration of the for loop on line 21 of CanonicalAnchors. Call the value of seen1 (and seen_paths) SbS_b. Similarly, call the updated value of G in AllAnchors GbG_b and the updated value of G in CanonicalAnchors HbH_b. We have, by the induction hypothesis, that HbGbH_b \subseteq G_b.

Repeating the argument, we obtain that there are iterations of the for loops on lines 30 and 32 of AllAnchors that correspond to the second and third recursive calls to CanonicalAnchors on line 22 of the procedure. Finally, the concatenation of anchor lists on line 36 of AllAnchors is equivalent to the repeated concatenations on line 25 of CanonicalAnchors, and so we conclude that the induction hypothesis holds for ww.

We will see that the overall runtime complexity of AllAnchors can be easily derived from a bound on the size of the returned list. For this, we use the following result:

Proposition 4.10Number of anchor sets in AllAnchors
For a graph GG, a root operation rr in GG and 1wwidth(G)1 \leq w \leq width(G), the length of the list AllAnchors(G,r,w)(G, r, w) is in O(cww3/2)O(c^w \cdot w^{-3/2}), where c=6.75c = 6.75 is a constant.
Proof

Let CwC_w be an upper bound for the length of the list returned by a call to AllAnchors for width ww. For the base case w=0w = 0, C0=1C_0 = 1. The returned all_anchors list is obtained by pushing anchor lists one by one on line 36. We can count the number of times this line is executed by multiplying the length of the lists returned by the recursive calls on lines 28–32, giving us the recursion relation

Cw0w1,w2,w3<ww1+w2+w3=w1Cw1Cw2Cw3.C_w \leq \sum_{\substack{0 \leq w_1, w_2, w_3 < w\\w_1 + w_2 + w_3 = w - 1}} C_{w_1} \cdot C_{w_2} \cdot C_{w_3}.

Since CwC_w is meant to be an upper bound, we replace \leq with equality above to obtain a recurrence relation for CwC_w. This recurrence relation is a generalisation of the well-known Catalan numbers Stanley, 2015Richard P. Stanley. 2015. Catalan Numbers. Cambridge University Press. doi: 10.1017/CBO9781139871495, equivalent to counting the number of ternary trees with ww internal nodes: a ternary tree with w1w \geq 1 internal nodes is made of a root along with three subtrees with w1,w2w_1,w_2 and w3w_3 internal nodes respectively, with w1+w2+w3=w1w_1 + w_2 + w_3 = w-1. A closed form solution to this problem can be found in Aval, 2008Jean-Christophe Aval. 2008. Multivariate Fuss–Catalan numbers. Discrete Mathematics 308, 20 (October 2008, 4660–4669). doi: 10.1016/j.disc.2007.08.100​:

Cw=(3ww)2w+1=Θ(cww3/2)C_w = \frac{{3w \choose w}}{2w + 1} = \Theta \left(\frac{c^w}{w^{3/2}} \right)

satisfying the above recurrence relation with equality, where c=27/4=6.75c = 27/4 = 6.75 is a constant obtained from the Stirling approximation:

(3ww)=(3w)!(2w)!w!=Θ(1w)((3w)3e3)w(e2(2w)2)w(ew)w=Θ((27/4)ww1/2).\begin{aligned}{3w \choose w} = \frac{(3w)!}{(2w)!w!} &= \Theta\left(\frac{1}{\sqrt{w}}\right) \Big(\frac{(3w)^3}{e^3}\Big)^{w}\Big(\frac{e^2}{(2w)^2}\Big)^{w}\Big(\frac{e}{w}\Big)^{w}\\ &= \Theta\left(\frac{(27/4)^w}{w^{1/2}}\right).\end{aligned}

To obtain a runtime bound for AllAnchors, it is useful to identify how much of GG needs to be traversed. If we suppose all patterns have at most depth dd, then it immediately follows that any operation in GG that is in the image of a pattern embedding must be at most a distance dd away from an anchor operation. We can thus equivalently call AllAnchors on a subgraph of GG such that no linear path is longer than 2d2d. We thus obtain the following runtime.

Proposition 4.12Runtime of AllAnchors

For patterns with at most width ww and depth dd, the total runtime of AllAnchors is in

O(cwdw1/2).O\left(\frac{c^w \cdot d}{w^{1/2}}\right).

Proof

We restrict Operations on line 9 to only return the first dd operations on the linear path in each direction, starting at the anchor operation: operations more than distance dd away from the anchor cannot be part of a pattern of depth dd.

We use the bound on the length of the list returned by calls to AllAnchors of Proposition 4.10 to bound the runtime. We can ignore the non-constant runtime of the concatenation of the outputs of recursive calls on line 35, as the total size of the outputs is asymptotically at worst of the same complexity as the runtime of the recursive calls themselves. Excluding the recursive calls, the only remaining lines of AllAnchors that are not executed in constant time are the while loop on lines 15–18 and the Operations and sort calls on lines 9–11. Using the same argument as in CanonicalAnchors, we can ignore the latter two calls by replacing the queue of operations by a lazy iterator of operations. The next operation given op and the graph G can always be computed in O(1)O(1) time using a depth-first traversal of G.

Consider the recursion tree of AllAnchors, i.e. the tree in which the nodes are the recursive calls to AllAnchors and the children are the executions spawned by the nested for loops on line 28–32. This tree has at most

Cw=Θ(cww3/2)C_w = \Theta\left(\frac{c^w}{w^{3/2}}\right)

leaves. A path from the root to a leaf corresponds to a stack of recursive calls to AllAnchors. Along this recursion path, seen_paths set is always strictly growing (line 35) and the operations removed from G on lines 14 and 18 are all distinct. For each linear path, at most 2d2d operations are traversed. Thus the total runtime of the while loop (lines 15–18) along a path from root to leaf in the recursion tree is in O(wd)O(w \cdot d). We can thus bound the overall complexity of executing the entire recursion tree by O(Cwwd)=O(cwdw1/2)O(C_w \cdot w \cdot d) = O(\frac{c^w \cdot d}{w^{1/2}}).


  1. Every anchor operation is on at least one previously unseen linear path, thus there can be at most ww operations in the set of anchors. ↩︎