An automaton for multi-pattern matching

4.6. An automaton for multi-pattern matching

We have shown in the previous sections that graph pattern-matching can be reduced to a problem of tree inclusions, with trees of fixed width ww. To complete the pattern-matching algorithm, we must provide a fast way to evaluate the subtree relation for many trees representing the set of all patterns we wish to match.

More precisely, for patterns P1,,PP_1, \dots, P_\ell with width ww, fix a root operation rir_i in PiP_i for each 1i1 \leqslant i \leqslant \ell and consider the rooted tree duals of the canonical PSGs τri(Piπi)\tau_{r_i}(P_i^{\pi_i}), with πi=πri(Pi)\pi_i = \pi_{r_i}(P_i) the canonical anchors. Then given a subject graph GG, we wish to compute the set

{1iτri(Piπi)τr(Gπ)},\{1 \leqslant i \leqslant \ell \mid \tau_{r_i}(P_i^{\pi_i}) \subseteq \tau_r(G^\pi)\},

for all anchor sets πΠrw(G)\pi \in \Pi_r^w(G) and root operation rr in GG. This corresponds to the IsSubTree predicate introduced in the sketch of the algorith in section 4.4.

Instead of considering the trees of PSGs, it will prove easier to consider the contracted PSGs (cPSGs)

τri(c(Piπi))andτr(c(Gπ)).\tau_{r_i}(c(P_i^{\pi_i}))\quad\textrm{and}\quad \tau_r(c(G^\pi)).

Such tree inclusions are equivalent to finding embeddings in the subject graph itself, provided that we keep track of the splitsplit and contractcontract weight maps (see section 4.3).

It will be useful to remind ourselves the following properties of contracted PSGs. Every operation of a cPSG (and thus every node in its rooted dual tree) is an anchor operation of the PSG. Per Proposition 4.4, the rooted dual tree of a cPSG is a ternary tree and has exactly width(G)1width(G) - 1 nodes. Finally, recall the concept of an open value of a graph, i.e. a value that is missing either a use or define operation (see section 4.2).

Reduction of tree inclusion to string prefix matching #

Now consider two contracted spanning tree reductions c(G1π1)c(G_1^{\pi_1}) and c(G2π2)c(G_2^{\pi_2}) with values V1V_1 and V2V_2. To simplify notation, define

τ1=τr1(c(G1π1))andτ2=τr2(c(G2π2))\tau_1 = \tau_{r_1}(c(G_1^{\pi_1})) \quad\textrm{and}\quad \tau_2 = \tau_{r_2}(c(G_2^{\pi_2}))

for some choice of root operations r1r_1 and r2r_2 in G1G_1 and G2G_2, respectively. We lift the \subseteq relation on rooted dual trees of PSGs introduced in section 4.5 to rooted dual trees of cPSGs in Such a way that there is an inclusion relation between two rooted dual trees of PSGs if and only if the same relation holds on the rooted duals of cPSGs.

We say that τ1τ2\tau_1 \subseteq \tau_2 if and only if

  • the trees share the same root operation,
  • τ1\tau_1 is a subtree of τ2\tau_2,
  • the spiltspilt map coincides on the common subtree, and
  • the contractcontract map satisfies for all vV1v \in V_1:
    {contract(v)contract(f(v))if v is an open value,contract(v)=contract(f(v))otherwise,\begin{cases}contract(v) \subseteq contract(f(v))\quad&\textrm{if }v\textrm{ is an open value},\\contract(v) = contract(f(v))\quad&\textrm{otherwise},\\\end{cases}

where f:V1V2f: V_1 \hookrightarrow V_2 designates the embedding of V1V_1 into V2V_2 given by the tree embedding.

The first three conditions are taken as-is from the \subseteq relation on non-contracted trees, whilst the fourth condition on the contractcontract map is specific to contracted trees.

Using Proposition 4.2, there are at most 2 open values for each linear path in the graph, and thus at most 2w2 \cdot w open values in a rooted dual tree of a cPSG of width ww. For each such contracted rooted dual, we can thus define a contracted string tuple S=(s1,,s2w)(O)2wS = (s_1, \dots, s_{2w}) \in (O^\ast)^{2w} given by the values of the contractcontract map evaluated in the (up to) 2w2w open values1.

If contractCcontract|_C is the restriction of contractcontract to the domain of definition of non-open values of a cPSG, the fourth condition for the inclusion relation \subseteq on rooted dual cPSGs, given above becomes an equality condition when restricted to non-open values. A special case of this property of particular interest to us is stated as the following result. The \subseteq relation on strings refers to prefix inclusion, i.e. sts \subseteq t if and only if ss is a prefix of tt.

Proposition 4.14Inclusion of equal-width trees

Let S=(s1,,s2w)S = (s_1, \dots, s_{2w}) and T=(t1,,t2w)(O)2wT = (t_1, \dots, t_{2w}) \in (O^\ast)^{2w} be the contracted string tuples of τ1\tau_1 and τ2\tau_2 respectively. Then τ1τ2\tau_1 \subseteq \tau_2 if and only if the trees share the same root, are isomorphic, have the same splitsplit and contractCcontract|_C maps and for all i{1,,2w}i \in \{1, \dots, 2w\}: sitis_i \subseteq t_i.

The proof of this follows directly from observing that rooted duals of cPSGs have the same set of nodes and that the restriction to non-open values contractCcontract|_C must satisfy equality.

Why restricting ourselves to trees of the same width ww? It is sufficient for our purposes! All patterns are of width ww by assumption and so are the rooted dual trees of the form τr(Gπ)\tau_r(G^\pi), given that πΠrw(G)\pi \in \Pi_r^w(G).

The string prefix matching problem is a simple computational task that can be generalised to check for multiple string patterns at the same time using a prefix tree. An overview of this problem can be found in appendix A. We can thus obtain a solution for the pattern matching problem for \ell patterns:

Proposition 4.15Fixed anchor pattern matching

As above, let

  • GG be a graph, with πΠrw(G)\pi \in \Pi_r^w(G) a set of w1w - 1 operations and rπr \in \pi a choice of root operation,
  • P1,,PP_1, \dots, P_\ell be patterns of width ww and depth dd, with choices of root operations r1,,rr_1, \dots, r_\ell and canonical anchors πi=πri(Pi).\pi_i = \pi_{r_i}(P_i).

The set of all pattern embeddings mapping the canonical anchor set πi\pi_i to π\pi and root rir_i to rr for 1i1 \leq i \leq \ell can be computed in time O(wd)O(w\cdot d) using at most \ell pre-computed prefix tree of size at most (d)w+1(\ell \cdot d)^w + 1, each constructed in time complexity O((d)w)O((\ell \cdot d)^w).

Proof

For each pattern, we consider its canonical spanning tree reduction and construct a multi-dimensional prefix tree (see Appendix ) for each group of patterns that share the same spanning tree reduction.

Given a graph GG, we can compute the cPSG of GG for anchors π\pi and map its rooted dual tree to the corresponding prefix tree. This can be done in O(TG)O(|T_G|) time by using a search tree. We can restrict GπG^\pi to a graph of size O(wd)O(w \cdot d) by truncating the linear paths to at most 2d2d length, as in the proof of Proposition 4.12. Thus we can assume GπO(wd)|G^\pi| \in O(w \cdot d).

The rest of the proof and the runtime follow from the multi-dimensional prefix tree construction detailed in Appendix ).

Combining everything #

Finally, putting Proposition . and Proposition 4.12 together, we obtain our main result.

Proposition 4.13Pattern matching

Let P1,,PP_1, \dots, P_\ell be patterns with width ww and depth dd. The pre-computation runs in time and space complexity

O((d)w+wd).O \left( (d\cdot \ell)^w \cdot \ell + \ell \cdot w \cdot d \right).

For any subject graph GG, the pre-computed prefix tree can be used to find all pattern embeddings PiGP_i \to G in time

O(Gcww1/2d)O \left( |G| \cdot \frac{c^w}{w^{1/2}} \cdot d \right)

where c=6.75c = 6.75 is a constant.

Proof

The pre-computation consists of running the CanonicalAnchors procedure on each of the \ell patterns and then transforming them into a map of prefix trees using Proposition .. By Proposition 4.7, CanonicalAnchors runs in O(wd)O(w\cdot d) for each pattern, where we used that Piwd|P_i| \leqslant w \cdot d for all patterns. The total runtime of prefix construction is thus

O((d)w+wd).O \left( (d\cdot \ell)^w \cdot \ell + \ell \cdot w \cdot d \right).

The complexity of pattern matching itself on the other hand is composed of two parts: the computation of all possible anchor sets Πrw(G)\Pi_r^w(G), and the execution of the prefix string matcher for each of the trees resulting from these sets πΠrw(G)\pi \in \Pi_r^w(G). As AllAnchors must be run for every choice of root vertex rr in GG, the runtime is thus obtained by multiplying i) G|G| with ii) the runtime of the prefix tree matching (Proposition .), and with iii) Πrw(G)|\Pi_r^w(G)|, i.e. the number of anchor lists returned by AllAnchors (Proposition 4.10):

O(GwdCw),O(|G| \cdot w \cdot d \cdot C_w ),

where CwC_w is the bound for the number of anchor lists returned by AllAnchors. The result follows.


  1. The values can be ordered as usual by using the total lexicographic order on port labels of the tree. ↩︎