Benchmarks

4.7. Benchmarks

Proposition 4.13 shows that pattern-independent matching can scale to large datasets of patterns but imposes some restrictions on the patterns and embeddings that can be matched. In this section, we discuss these limitations and give empirical evidence that the pattern-matching approach we have presented can be used on a large scale and outperform existing solutions.

Pattern limitations #

In section 4.2, we imposed conditions on the pattern embeddings to obtain a complexity bound for pattern-independent matching. We argued how these restrictions are natural for applications in quantum computing, and most of the arguments will also hold for a much broader class of computation graphs.

In future work, it would nonetheless be of theoretical interest to explore the importance of these assumptions and their impact on the complexity of the problem. As a first step towards a generalisation, our implementation and all our benchmarks in this section do not make any of these simplifying assumptions. Our results below give empirical evidence that a significant performance advantage can be obtained regardless.

Implementation #

We provide an open-source implementation in Rust of pattern independent matching using the results of this chapter, available on GitHub. The code and datasets used for the benchmarks themselves are available in a dedicated repository.

The implementation works for weighted or unweighted port graphs – of which typed minIR graphs are a special case – and makes none of the simplifying assumptions employed in the theoretical analysis. Pattern matching proceeds in two phases: precomputation and runtime.

Precomputation.  In a first step, all graph patterns are processed and compiled into a single state automaton that will be used at runtime for fast pattern independent matching. The automaton in the implementation combines in one data structure two distinct computations of this chapter:

  • the recursive branching logic used in the AllAnchors procedure to enumerate all possible choices of anchors.
  • the automaton described in section 4.6 that matches patterns for a fixed set of anchors, and

The former is implemented with non-deterministic state transitions – each transition corresponding to choosing an additional anchor – , whereas the latter is implemented deterministically.

Concretely, the automaton is constructed by following the construction of section 4.4 to decompose each pattern into its canonical path-split graph. We then order the nodes of the PSG and express each node as a condition that ensures the connectivity and node weight in the graph matches the pattern. We thus obtain a chain of conditions, with a transition between any two consecutive conditions; transitions are deterministic by default and marked as non-deterministic whenever they lead to a condition on an anchor node. The state automaton for all patterns is then obtained by joining all chains of conditions into a tree.

Runtime.  Pattern matching is then as simple as simulating the state automaton, evaluating all conditions on the graph GG passed as input. The states in the automaton corresponding to the last condition of a pattern must be marked as end states, along with a label identifying the pattern that was matched. This can then be used at runtime to report all patterns found.

Our implementation has been tested for correctness, i.e. on the one hand that all matches that are reported are correct, and on the one hand that all pattern matches are found. This was done by comparing the matches of our implementation with the results obtained from matching every pattern separately on millions of randomly generated graphs and edge cases. We also ensured during benchmarking that the number of matches reported by our implementation and by Quartz were always the same.

Benchmarks #

Baseline.  To assess practical use, we have benchmarked our implementation against a leading C++ implementation of pattern matching for quantum circuits from the Quartz superoptimiser project Xu, 2022Mingkuan Xu, Zikun Li, Oded Padon, Sina Lin, Jessica Pointing, Auguste Hirth, Henry Ma, Jens Palsberg, Alex Aiken, Umut A. Acar and Zhihao Jia. 2022. Quartz: Superoptimization of Quantum Circuits. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, June 2022. Association for Computing Machinery, 625--640. doi: 10.1145/3519939.3523433. This implementation is the principal component of an end-to-end quantum circuit optimisation pipeline. The results and speedups we obtain here thus apply and transfer directly to this application.

Dataset.  We further ensure that our results apply in practice by using a real-world dataset of patterns. The Quartz optimiser finds opportunities for circuit optimisation by relying on precomputed equivalence classes of circuits (ECC). These are obtained exhaustively by enumerating all possible small quantum circuits, computing their unitaries and clustering them into classes of circuits with identical unitaries.

The generation of ECC sets is parametrised on the number of qubits, the maximum number of gates and the gate set in use. For these benchmarks we chose the minimal set of gates T,H,CXT, H, CX and considered circuits with up to 6 gates and 2, 3 or 4 qubits. The size of these pattern circuits is typical for the application1.

Thus, for our patterns, we have the bound d6d \leq 6 for the maximum depth and width w=2,3,4w = 2,3,4. In all experiments, the graph GG subject to pattern matching was barenco_tof_10 input, i.e. a 19-qubit circuit input with 674 gates obtained by decomposing a 10-qubit Toffoli gate using the Barenco decomposition Barenco, 1995Adriano Barenco, Charles H. Bennett, Richard Cleve, David P. DiVincenzo, Norman Margolus, Peter Shor, Tycho Sleator, John A. Smolin and Harald Weinfurter. 1995. Elementary gates for quantum computation. Physical Review A 52, 5 (November 1995, 3457--3467). doi: 10.1103/PhysRevA.52.3457.

Results.  We study the runtime of our implementation as a function of the number \ell of patterns being matched, up to =104\ell = 10^4 patterns. We expect the runtime of pattern matching algorithms that match one pattern at a time to scale linearly with \ell. On the other hand, Proposition 4.13 results in a complexity that is independent of \ell.

For each value of \ell, we select a subset of all patterns in the ECC sets at random. For w=2w = 2, there are only a total of =1954\ell = 1954 patterns, explaining why we do not report result beyond that number. For =200\ell = 200 patterns, our proposed algorithm is 3×3\times faster than Quartz. As expected, the advantage of our approach increases as we match more patterns, scaling up to a 20×20\times speedup for =104\ell=10^4. The results are summarised in the following figure:

Runtime of pattern matching for ℓ=0…104\ell = 0\dots 10^4ℓ=0…104 patterns on 2, 3 and 4 qubit quantum circuits from the Quartz ECC dataset, for our implementation (Portmatching) and the Quartz project. All ℓ=1954\ell = 1954ℓ=1954 two-qubit circuits were used, whereas for 3 and 4 qubit circuits, ℓ=104\ell = 10^4ℓ=104 random samples were drawn.

Runtime of pattern matching for =0104\ell = 0\dots 10^4 patterns on 2, 3 and 4 qubit quantum circuits from the Quartz ECC dataset, for our implementation (Portmatching) and the Quartz project. All =1954\ell = 1954 two-qubit circuits were used, whereas for 3 and 4 qubit circuits, =104\ell = 10^4 random samples were drawn.

Dependency on ww and \ell.  We further study the runtime of our algorithm as a function of its two main parameters, the number of patterns \ell and the pattern width ww, on an expanded dataset. To this end, we generate random sets of 10,000 pattern circuits with 15 gates and between w=2w=2 and w=10w=10 qubits, using the same gate set as previously. The resulting pattern matching runtimes are shown in the figure below.

From Proposition 4.13, we expect that the pattern matching runtime is upper bounded by a \ell-independent constant. Our results support this result for w=2w=2 and w=3w=3 qubit patterns, where runtime seems indeed to saturate, reaching an observable runtime plateau at large \ell.

We suspect on the other hand that the exponential cwc^w dependency in the complexity bound of Proposition 4.13 makes it difficult to observe a similar plateau for w4w \geq 4, as we expect this upper bound on the runtime to increase rapidly with qubit counts . A runtime ceiling is not directly observable at this experiment size, but the gradual decrease in the slope of the curve is consistent with the existence of the \ell-independent upper bound predicted in Proposition 4.13.

Runtime of our pattern matching for random quantum circuits with up to 10 qubits.

Runtime of our pattern matching for random quantum circuits with up to 10 qubits.


  1. Such small circuit sizes are imposed in part by the fact that ECCs of larger circuits quickly become unfeasible to generate as their number grows exponentially. In practice, large circuit transformations can often be expressed as the composition of smaller atomic transformations, hence the good performance of this approach in practice. ↩︎