XL ZL Sequential Graph-Based Decoding of the Surface Code using a Hybrid Graph and Recurrent Neural Network Model Master’s thesis in Complex Adaptive Systems OLE FJELDSÅ GUSTAF JONASSON JOHANSSON DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2025 www.chalmers.se www.chalmers.se Master’s thesis 2025 Sequential Graph-Based Decoding of the Surface Code using a Hybrid Graph and Recurrent Neural Network Model OLE FJELDSÅ, GUSTAF JONASSON JOHANSSON Department of Physics Chalmers University of Technology Gothenburg, Sweden 2025 Sequential Graph-Based Decoding of the Surface Code using a Hybrid Graph and Recurrent Neural Network Model OLE FJELDSÅ GUSTAF JONASSON JOHANSSON © OLE FJELDSÅ, GUSTAF JONASSON JOHANSSON, 2025. Supervisor: Mats Granath, Department of Physics, University of Gothenburg. Examiner: Mats Granath, Department of Physics, University of Gothenburg. Master’s Thesis 2025 Department of Physics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Example syndrome with accompanying graphs of size dT = 2 over three surface code cycles. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2025 iv Sequential Graph-Based Decoding of the Surface Code using a Hybrid Graph and Recurrent Neural Network Model OLE FJELDSÅ GUSTAF JONASSON JOHANSSON Department of Physics Chalmers University of Technology Abstract In order to achieve reliable quantum computation with noisy qubits, quantum error correction (QEC) is necessary. Quantum error correcting codes mitigate the inher- ent noise in quantum systems by distributing the logical state over several qubits, thereby introducing redundancy. One such promising code is the surface code. It encodes the logical qubit using a two-dimensional lattice of physical data qubits and ancilla qubits. By taking and decoding measurements on the ancilla qubits of the surface code, one can deduce whether a logical bit- or phase-flip has occurred. However, this is a complex and potentially time-consuming task. Multiple decoding algorithms exist, such as the classical minimum-weight perfect matching (MWPM) decoder. In recent years, data-driven algorithms have been shown to decode the surface code with a high degree of accuracy. In this thesis, we present a machine learning approach to decoding the surface code using a combination of graph neural networks (GNN) and recurrent neural networks (RNN). Specifically, graph represen- tations are constructed over a short, sliding time window of syndrome measurement data. Each representation is processed by a GNN and its output is used as a learned high-dimensional embedding for an RNN. This enables continuous decoding of mea- surement patterns over longer time series. While the decoder is trained on relatively short syndromes, it is able to generalize for unseen syndromes and longer time se- ries, outperforming the classical MWPM algorithm across both short and long time series. This work opens up a new approach to reliable and potentially fast decoding of QEC codes. Keywords: Quantum error correction, graph neural networks, recurrent neural net- works, gated recurrent unit, surface code. v Acknowledgements We are grateful to Mats Granath and Moritz Lange for their invaluable guidance, insightful discussions, and patience throughout this project. We also thank Blaž Pridgar for his technical help and contributions to our model analysis. Finally, we appreciate Tobias Wallström and Eduardo Manoni for their spirited banter. Computations were enabled by resources provided by the National Academic Infras- tructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725. Ole Fjeldså and Gustaf Jonasson Johansson, Gothenburg, June 2025 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: CPU Central Processing Unit FPGA Field Programmable Gate Array GMP Global Mean Pool GNN Graph Neural Network GPU Graphics Processing Unit GRU Gated Recurrent Unit LSTM Long Short-Term Memory MWPM Minimum Weight Perfect Matching RNN Recurrent Neural Network ix Contents List of Acronyms ix List of Figures xiii 1 Introduction 1 2 Theory 3 2.1 Graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Quantum computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2.1 Qubit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2.2 Pauli operators . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2.3 Quantum circuits . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.4 Quantum gates . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.5 Entanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Quantum error codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Bit-flip repetition code . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 The stabilizer formalism . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 The surface code . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.4 Minimum-weight perfect matching . . . . . . . . . . . . . . . 13 2.4 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 Feed forward, fully connected neural networks . . . . . . . . . 14 2.4.2 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . 15 2.4.3 Gated recurrent unit . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.4 Graph convolutional networks . . . . . . . . . . . . . . . . . . 17 2.4.5 Global mean pool . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Methods 19 3.1 Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Neural network architecture . . . . . . . . . . . . . . . . . . . . . . . 21 3.3 Training setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Results 26 4.1 Decoder training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2 Decoder performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Conclusions 30 xi Contents References 33 A Appendix I A.1 Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I xii List of Figures 2.1 The Bloch sphere. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Entanglement of two qubits. . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 The two qubit encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 The surface code four-cycle. . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Surface code from four-cycles. . . . . . . . . . . . . . . . . . . . . . . 11 2.6 The simplified surface code schematic. . . . . . . . . . . . . . . . . . 12 2.7 Logical XL, YL and ZL operators on the surface code. . . . . . . . . . 12 2.8 MWPM example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.9 MWPM example on the surface code. . . . . . . . . . . . . . . . . . . 14 2.10 Deep neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.11 Fundamental structure for a recurrent neural network. . . . . . . . . 16 2.12 The gated recurrent unit. . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.13 Graph convolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Sliding window schema. . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Time required to generate syndrome graphs based on surface codes. . 21 3.3 Overview of the entire model pipeline. . . . . . . . . . . . . . . . . . 23 3.4 Proportion of class 0 syndromes as a function of syndrome length. . . 25 4.1 Accuracy and model output as a function of training epoch. . . . . . 27 4.2 Accuracy and failure rate as a function of code cycles. . . . . . . . . . 28 4.3 Accuracy and failure rate as a function of code cycles. . . . . . . . . . 28 4.4 Accuracy and failure rate as a function of error rate. . . . . . . . . . . 29 xiii List of Figures xiv 1 Introduction Quantum computing has the potential to transform how we solve complex prob- lems in fields such as drug design [1], chemistry [2], and financial market analysis [3]. Perhaps its most well-known application is prime factorization [4], [5]. Unlike classical computers, which process information using bits that exist in one of two dis- tinct states (0 or 1), quantum computers use qubits that can exist in superpositions of both states simultaneously. This unique property, combined with entanglement and other quantum phenomena, enables quantum algorithms to outperform classical methods in certain tasks. However, qubits are highly fragile, with their quantum state easily disrupted by environmental noise or imperfect control signals. Even the slightest interference can cause decoherence, collapsing the superposition and erasing valuable information. To counteract this, quantum error correction (QEC) is essential for preserving coherence and enabling reliable quantum computation. To allow for error correction in a classical computer we can use a simple repetition code. For example, a simple three bit encoding {0, 1} → {000, 111} can correct a single bit-flip error and detect a two bit-flip error. Unfortunately, such an error correction scheme is not possible for qubits due to the no-cloning theorem [6], the presence of both bit-flips and phase-flips, and the wave function collapse resulting from reading the qubit. To perform error correction on qubits, stabilizer codes are used. A stabilizer code is a type of quantum error correcting code that protects quantum information by encoding so-called logical qubits as a larger set of physical qubits by entanglement. It is defined by a set of stabilizer operators, which are elements of the Pauli group that commute with each other and stabilize the encoded logical states. Errors are detected by measuring these stabilizers without collapsing the quantum state of the logical qubit. The surface code [7], [8] is one such stabilizer code. It is a topological error correcting code that protects quantum information by encoding a logical qubit in a two-dimensional lattice of physical qubits. Efficiently decoding the surface code is a challenging problem. One widely used approach is the Minimum Weight Perfect Matching (MWPM) decoder [8]–[16], a classical method that performs well in general but struggles with correlated X/Z error rates and imperfect measurements. Maximum likelihood decoders [8], [17]– [20], on the other hand, achieve excellent accuracy but are computationally infea- sible due to their exponential time complexity. A promising alternative is machine learning-based decoders, a number of which have been proposed [21]–[34]. Machine 1 1. Introduction learning models have the potential to decode rapidly once trained and are adaptable to different noise models, offering both flexibility and efficiency. This work builds on previous work from from the Department of Physics at Uni- versity of Gothenburg [35] using graph neural networks (GNN) to classify the most likely error class of a graph of stabilizer measurements. We build on that by break- ing down graphs of longer syndromes into overlapping sub-graphs. These graphs are then processed by a combination of a GNN and a recurrent neural network (RNN) to generalize the decoding over variable time frames, in a similar fashion to [22]. We show that the combined GNN-RNN approach outperforms MWPM over long time series for simulated stabilizers under surface level noise. The decoder is able to generalize over several thousand measurement cycles while only being trained on syndromes of 49-99 stabilizer cycles. 2 2 Theory This chapter provides the necessary theoretical background for this thesis. It cov- ers the basics of graph theory, quantum computing, quantum error correction and machine learning with neural networks. 2.1 Graph theory A directed weighted graph is described by the tuple G = (N,E,w) where N is a set of nodes, E ⊆ {(ni, nj) | ni, nj ∈ N2, i ̸= j} is a set of directed edges between nodes and w is a function E → T mapping edges to a weight of type T . Similarly, an undirected weighted graph has a set of undirected edges E ⊆ {(ni, nj) | ni, nj ∈ N, i ̸= j} between nodes. An undirected weighted graph can be obtained from a directed weighted graph by ensuring that each edge is bidirectional and equally weighted in both directions i.e. if (ni, nj) ∈ E then (nj, ni) ∈ E and w(ni, nj) = w(nj, ni). Graphs serve as versatile models for relational structures, where nodes represent entities and edges denote relationships. The number of nodes in a graph |N | is called the order of the graph and the number of edges in a graph |E| is called the size of the graph. There are several different ways of deciding the weights of an edge. If for example the nodes are coordinates (x, y), the weights between them could be the distance between the coordinates. In this example the coordinates (x, y) would be referred to as the node features of the graph. 2.2 Quantum computing This section provides the necessary background in quantum computing to build the concepts of quantum error correction. 2.2.1 Qubit The qubit is the quantum computer equivalent to the bit in a classical computer. It represents the basic unit of information in quantum computing. The qubit |ψ⟩ is written as |ψ⟩ = α |0⟩ + β |1⟩ = ( α β ) where |0⟩ = ( 1 0 ) , |1⟩ = ( 0 1 ) are orthonormal basis vectors, and α, β are complex 3 2. Theory values with |α|2 + |β|2 = 1. Here α, β give us the probabilities of measuring a qubit as |0⟩ or |1⟩. For example, the probability of measuring a zero is equal to |α|2. X Y Z | |+ |- |i |-i |0 |1 Figure 2.1: The Bloch sphere. Any state corresponds to a point on the sphere, the positive and negative z direction represent the basis states |0⟩ and |1⟩. The position of the state is defined by (2.1). Figure 2.1 shows a geometric interpretation of the qubit called the Bloch sphere [36] with the qubit state represented by a point on the sphere as |ψ⟩ = cos θ2 |0⟩ + eiφ sin θ2 |1⟩ (2.1) In addition to the basis states we have the notable states |+⟩, |−⟩, |i⟩ and |−i⟩ on the positive and negative x and y axis of the sphere. All of these states have an equal probability of being measured as |0⟩ or |1⟩ but have different phases. Multiple qubits are written using the tensor product ⊗ |ψ⟩1 ⊗ |ψ⟩2 = (α1 |0⟩ + β1 |1⟩) ⊗ (α2 |0⟩ + β2 |1⟩) = α1α2 |00⟩ + α1β2 |01⟩ + β1α2 |10⟩ + β1β2 |11⟩ (2.2) For the two qubit case this can be written as a state vector of length four α |00⟩ + β |01⟩ + γ |10⟩ + δ |11⟩ =  α β γ δ  (2.3) 4 2. Theory 2.2.2 Pauli operators The Pauli operators are 1 = ( 1 0 0 1 ) , X = ( 0 1 1 0 ) , Y = ( 0 −i i 0 ) , Z = ( 1 0 0 −1 ) (2.4) Each of the operators X, Y, and Z have the effect of flipping the qubit π radians around the x, y, or z or axis on the Bloch Sphere. As we can see from figure 2.1, the X operator corresponds to a bit-flip error and the Z operator corresponds to a phase-flip error. In the context of quantum error correction, the Y operator can be seen as a combination of the X and Z operator as it corresponds to Y = iXZ. Applying the X or Z operator to a qubit |ψ⟩ yields X |ψ⟩ = αX |0⟩ + βX |1⟩ = α |1⟩ + β |0⟩ (2.5a) Z |ψ⟩ = αZ |0⟩ + βZ |1⟩ = α |0⟩ − β |1⟩ (2.5b) The Pauli group on a single qubit is defined as the Pauli operators G1 = {±1,±i1,±X,±iX,±Y,±iY,±Z,±iZ}, (2.6) with the ±1,±i terms to ensure that G1 is closed under multiplication. Every mem- ber of G1 has eigenvalues ±1 or ±i, is unitary, hermitian and self-inverse. Addi- tionally, any two members of the Pauli group satisfy the commutation and anti- commutation relations [σj, σk] = σjσk − σkσj = 2iεjklσl (2.7a) {σj, σk} = σjσk + σkσj = 2δjk1 (2.7b) where εjkl is the Levi-Civita symbol and δjk is the Kronecker delta εjkl =  1 if (j, k, l) ∈ {(x, y, z), (y, z, x), (z, x, y)} (even permutation) −1 if (j, k, l) ∈ {(x, z, y), (y, x, z), (z, y, x)} (odd permutation) 0 otherwise δjk = { 1 if j = k 0 otherwise The general Pauli group is made up of all operators formed from tensor products of elements from G1. If we for example have a three qubit system we can apply the operation 1 ⊗X ⊗ Y = X2Y3, where the left hand side is the tensor product of three operations (applied to their respective qubit), and the right hand side is the support notation of the same oper- ation. 5 2. Theory 2.2.3 Quantum circuits Qubits and operators are visualized using quantum circuit diagrams. Each qubit is represented by a horizontal line going from left to right through time. The different operations applied to the qubits are placed along these lines. Single qubit operations like H, X, and Z are represented as a single letter in a box placed on the line as in (2.8). Multiple qubit gates are represented using vertical connections between qubits as in (2.9) or rectangles covering multiple lines as in figure 2.3. Single lines represent qubits while double lines represent classical information, the measurement of a qubit is represented by the meter symbol . 2.2.4 Quantum gates Besides the Pauli operators, some additional quantum operators are needed, namely the Hadamard gate H and the controlled X and Z gates CNOT and CZ. The Hadamard gate is defined as H = 1√ 2 ( 1 1 1 −1 ) |ψ⟩ H (2.8) It has the effect H |0⟩ = |+⟩ , H |1⟩ = |−⟩, and vice versa. In this way the Hadamard gate can be used to bring a qubit in and out of superposition. The CNOT and CZ gates are defined as CNOT =  1 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0  control target (2.9) CZ =  1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 −1  control target (2.10) These gates work as conditional X and Z gates. If the control is equal to |1⟩, apply X or Z to the target qubit. If the control and target are in superposition the CNOT and CZ gates have the effects CNOT (α |00⟩ + β |01⟩ + δ |10⟩ + γ |11⟩) = α |00⟩ + β |01⟩ + γ |10⟩ + δ |11⟩ (2.11) CZ(α |00⟩ + β |01⟩ + δ |10⟩ + γ |11⟩) = α |00⟩ + β |01⟩ + δ |10⟩ − γ |11⟩ (2.12) One important phenomenon enabled by controlled qubits is phase kickback, which refers to controlled operations having effects on both the target and control qubit [37]. This is essential to QEC and exemplified in section 2.3.1. 6 2. Theory 2.2.5 Entanglement One of the distinguishing features of quantum systems is the ability to entangle. In lieu of duplicating quantum information as a way of introducing redundancy, as that is not possible due to the no-cloning theorem [38], entanglement is utilized. Using entanglement, it is possible to encode information with multi-particle superpositions [37]. Figure 2.2 shows how one can entangle two qubits. As we can see from the figure, measuring one of the qubits at the end of the circuit collapses the state of the other qubit into a corresponding state due to entanglement. |ψ⟩ |0⟩  α 0 β 0   α 0 0 β  Figure 2.2: Entanglement of two qubits. The state vectors before and after the CNOT gate denote the states α |00⟩ + β |10⟩ and α |00⟩ + β |11⟩ respectively. 2.3 Quantum error codes In order to facilitate reliable QEC, one needs to introduce redundancy in the way a qubit is represented. In practice, this means distributing the logical state over multiple qubits. If only one qubit is used, it is not possible to deduce if its value is the result of some computation, or an error. Using multiple qubits increases the reliability of the system by taking advantage of the fact that an error happening in multiple qubits is less likely than an error happening in just one qubit [38]. This section covers the basics of QEC. 7 2. Theory 2.3.1 Bit-flip repetition code |ψ⟩1 E Z1 |0⟩2 Z2 |0⟩A H H |ψ⟩L E |ψ⟩L Figure 2.3: The two qubit encoder. The two qubits |ψ⟩1 and |0⟩2 are entangled using the CNOT gate as seen in figure 2.2 creating the logical qubit |ψ⟩L. An error from table 2.1 is applied to |ψ⟩L. The ancilla qubit in the |+⟩A state is used to perform the projective Z1Z2 measurement on |ψ⟩L. To allow for redundancy in a quantum code we can not simply copy bits [6]. Instead, we entangle the state of the qubits, expanding their Hilbert space. This spreads the state of a qubit |ψ⟩ over the logical state |ψ⟩L. Here we show the two qubit encoder as seen in figure 2.3 to exemplify this. |ψ⟩ two qubit encoder−−−−−−−−−−→ |ψ⟩L = α |00⟩ + β |11⟩ = α |0⟩L + β |1⟩L . This has the effect of changing the Hilbert space of the qubit. Before encoding, the qubit is parametrized within |ψ⟩ ∈ H2 = span{|0⟩ , |1⟩}. After the encoding this changes to the Hilbert space |ψ⟩L ∈ H4 = span{|00⟩ , |01⟩ , |10⟩ , |11⟩} that we can split up into two mutually orthogonal subspaces C = span{|00⟩ , |11⟩}, F = span{|01⟩ , |10⟩}. In the two qubit system we have the following bit-flip errors: Error 1112 X112 11X2 X1X2 Syndrome 0 1 1 0 Table 2.1: Syndrome table for the two qubit code. If we now apply the bit flip X1 to |ψ⟩L we get X1 |ψ⟩L = α |10⟩ + β |01⟩ ∈ F ⊂ H4 In addition to the logical qubit, the two qubit encoder utilizes a third qubit called the ancilla qubit. The ancilla qubit is initialized in the |0⟩A state before applying the Hadamard gate to it, changing it to the state |+⟩A. At this point the ancilla is 8 2. Theory prepared and ready to be used in the projective measurement Z1Z2 using two CZ gates. The Z1Z2 operator applied to |ψ⟩L yields the eigenvalue (+1) Z1Z2 |ψ⟩L = Z1Z2(α |00⟩ + β |11⟩) = (+1) |ψ⟩L (2.13) The Z1Z2 operator is said to stabilize |ψ⟩L as it leaves it unchanged. If we instead apply the Z1Z2 operator to X1 |ψ⟩L we end up with the eigenvalue (−1) Z1Z2X1 |ψ⟩L = Z1Z2(αX1 |00⟩ + βX1 |11⟩) = (−1)X1 |ψ⟩L (2.14) This would also be the case for the bit-flip error X2 |ψ⟩L. Since the ancilla qubit is used as a control for the Z1Z2 operator we get the state 1√ 2 ((1112)(E |ψ⟩L) |0⟩A + Z1Z2(E |ψ⟩L) |1⟩A) (2.15) After applying the second Hadamard gate to the ancilla qubit we end up in the state 1 2((1112 + Z1Z2)(E |ψ⟩L) |0⟩A + (1112 − Z1Z2)(E |ψ⟩L) |1⟩A) (2.16) If E |ψ⟩L ∈ F we get the eigenvalue Z1Z2E |ψ⟩L = (−1)E |ψ⟩L and the first term goes to zero. If instead E |ψ⟩L ∈ C then Z1Z2E |ψ⟩L = (+1)E |ψ⟩L and the second term goes to zero. Thus, measuring the ancilla qubit in the state |1⟩A informs us that E ∈ {X112,11X2} and measuring the ancilla in the state |0⟩A informs us that E ∈ {1112, X1X2}. 2.3.2 The stabilizer formalism The stabilizer formalism describes the formal requirements of a general [[n, k, d]] stabilizer code [39]. Here n is the total number of data qubits, k is the number of logical qubits and d is the stabilizer code distance. The code distance of a quantum error correction code is the minimum number of physical qubits that can go unde- tected and cause a logical error. In the case of the two qubit encoder we can detect but not correct a single bit-flip meaning that if the qubit was only susceptible to bit-flip errors it would be of distance d = 2. As qubits are susceptible to phase-flip errors as well, we need a stabilizer code to take the logical Z operator into account when determining the code distance. The stabilizers S of an [[n, k, d]] code is a subgroup of the n-qubit Pauli group Gn. Gn is made up of all operators of length n formed from tensor products of elements from G1(2.6). As with G1, the members of Gn either commute or anti-commute. All stabilizers must stabilize all logical states of the code space C. They must also com- mute with each other. The requirement for commutation is intersecting on an even number (including zero) of data qubits. This allows the stabilizers to be measured simultaneously [38]. Generally, any code that adheres to the stabilizer formalism can be used to, at most, detect an error of length d− 1, and correct an error of length d−1 2 . 9 2. Theory 2.3.3 The surface code A code that can detect both bit-flips and phase-flips is the so-called surface code. Originally defined on a torus [7], [8], it can instead be created using a two-dimensional lattice of qubits with two different types of boundaries [40]. The surface code four- cycle is the building block of the surface code. Figure 2.4 shows the quantum circuit diagram of the four-cycle and a simplified representation of the quantum circuit. D1 D2A2 A1 |D1⟩ |D2⟩ |0⟩A1 H H |0⟩A2 H H Figure 2.4: Surface code four-cycle. The left hand side diagram shows a simplified representation of the quantum circuit diagram on the right. Pink represents X stabilizers while blue represents Z stabilizers. The data qubits are represented by circles and the ancilla qubits by squares. While X and Z anti-commute (XZ = −ZX) the stabilizers XD1XD2 and ZD1ZD2 commute as they act on two shared qubits. XD1XD2ZD1ZD2 = XD1ZD1XD2ZD2 = −ZD1XD1XD2ZD2 = (−1)2ZD1XD1ZD2XD2 = ZD1ZD2XD1XD2 (2.17) Using the four-cycle we can build the surface code, the left hand side of figure 2.5 shows the [[13, 1, 3]] planar surface code. As we can see from the diagram, each stabilizer commutes as they still intersect non-trivially on an even number of qubits. From here we can reduce the cost of a logical qubit by rotating the planar surface code as seen on the right hand side of figure 2.5. This reduces the amount of physical qubits needed to encode the logical qubit while keeping the code distance d = 3. As we can see from the diagram, the required commutation relations still hold. 10 2. Theory D1 A1 D2 A2 D3 A3 D4 A4 D5 A5 D6 A6 D7 A7 D8 A8 D9 A9 D10 A10 D11 A11 D12 A12 D13 D1 A1 D2 A2 D3 A3 D4 A4 D5 A5 D6 A6 D7 A7 D8 A8 D9 A9 D10 A10 D11 A11 D12 A12 D13 Figure 2.5: Left: a tiling of the surface code four-cycle 2.4 creat- ing the planar [[13, 1, 3]] surface code. Right: the [[9, 1, 3]] rotated surface code made by rotating the four-cycle tiling. After constructing the rotated surface code from the surface code four cycle, the schematic is further simplified to the representation shown in figure 2.6, this is the representation that will be used in the rest of this thesis. Additionally, the rotated surface code will simply be referred to as the surface code. 11 2. Theory ZL XL Y X Z Figure 2.6: The simplified schematic for the [[25, 1, 5]] surface code. Data qubits are represented by the open circles, while sta- bilizers are represented by the squares and ”lobes” of the code. Dark gray colouration represents X stabilizers while light gray rep- resents Z stabilizers. Pink represents activated X stabilizers while blue represents activated Z stabilizers. Errors are represented by the letter on their respective data qubits. The logical X and Z operators are along the left and top border. On the surface code, the logical X and Z operators must still satisfy the anti- commutation relationship XLZL = −ZLXL. Additionally, XL and ZL must com- mute with all the stabilizers (preserve the eigenvalue of the ancilla qubits). This is achieved by letting XL and ZL span the surface code along its left and upper border. Furthermore, YL is defined as a combination of XL and ZL by replacing the top left corner data qubit with the Y operator and placing X and Z operators along the rest of their respective borders. Figure 2.7 shows an overview of the logical operators on the surface code. x x x x x z z z z z x x x x Y z z z z Figure 2.7: The logical X (left), Z (middle), and Y (right) opera- tors on the surface code. The logical error classes are the sets of errors having the same effect on the state of the logical qubit as a logical operator. As such, error chains with an odd parity of X operators along the upper border belong to the X coset, and error chains with 12 2. Theory an odd parity of Z operators along the left border belong to the Z coset. As with the logical X/Z operators, error chains with an odd parity along both the left and upper border belong to the Y coset [41]. Error chains with an even parity along the logical operators instead belong to the I coset. Figure 2.6 shows a syndrome belonging to the logical Y coset. As the stabilizers of a quantum error correction code can only show the parity of the data qubits surrounding them, and not the error chain itself, the purpose of a quantum error correction decoder is to decide the most likely chain of errors and based on the error chain deduce to which logical coset the syndrome belongs. 2.3.4 Minimum-weight perfect matching A matching of a graph G is a set of edges such that no edge shares a node with another edge. A perfect matching includes every node in G. In order for the perfect matching to be minimum-weight, it must minimize the sum of the edge weights. Figure 2.8 shows an example of the minimum-weight perfect matching (MWPM) of a graph. (0, 0) (4, 3)(1, 3) (3, 0) √ 10 3 3 √ 10 (0, 0) (4, 3)(1, 3) (3, 0) 3 3 MWPM Figure 2.8: Creating the MWPM of a graph. Here the nodes are 2D Cartesian coordinates and the edge weights are the distances between them. The MWPM of a graph can be found using the blossom algorithm [9] developed by Jack Edmonds in the 1960s, and it turns out this algorithm can be used to decode quantum stabilizer codes. In QEC literature it is known as the MWPM decoder [42]. Figure 2.9 shows an example of a matching using MWPM on the surface code. Note the bright coloured ”virtual stabilizers” on the boundary of the surface code. If an odd number of X or Z stabilizers are activated, the virtual boundary stabilizers facilitate the creation of a matching using MWPM, since an even number of nodes is required for this. Additionally, the virtual stabilizers allow for matchings to the edge of the surface code even when an even number of stabilizers are activated. As seen in figure 2.9, all virtual stabilizers do not need to be part of the matching, they are only used as required. After the matching of the surface code is created, it is used to deduce the most likely error chains. 13 2. Theory ZL XL z z z z z z z z z Figure 2.9: Left: surface code with a syndrome from the logical coset Z. Middle: the same syndrome with the fully connected graph between all X stabilizers and all X virtual boundary checks. Right: a matching using MWPM of the syndrome. The light blue and pink plaquettes outside the surface code represent the additional virtual boundary checks. 2.4 Neural networks This project is based on developing a QEC decoder using a data-driven approach. In particular, our decoder is based on neural networks. As such, this section covers the basics of neural networks as it relates to this work. 2.4.1 Feed forward, fully connected neural networks x0 x1 ... xnx ... . . . ... y′ 0 y′ 1 ... y′ ny Figure 2.10: Deep neural network with input X = {x0, x1, . . . , xnx}, a sequence of hidden layers, and an output Y′ = {y′ 0, y ′ 1, . . . , y ′ ny}. Each layer Xl is updated according to (2.20) with the input of each layer being the output of the previous. 14 2. Theory The building block of neural networks is the (artificial) neuron, first proposed as a mathematical model for the nervous system [43]. The value of a neuron N can be written as N(X) = f b+ ∑ xj∈X xjwj  (2.18) where X is the input vector to the neuron, wj is the weight from input xj to the neuron, b is the bias for the neuron, and f is a (typically non-linear) activation function. From this we can create a layer of neurons with output Xl taking the previous layer Xl−1 as input Xl(Xl−1) =  N0 N1 ... Nn  , where Ni(Xl−1) = f bi + ∑ xj∈Xl−1 xjwji  (2.19) This can be rewritten as the matrix operation Xl(Xl−1) = f (WlXl−1 + Bl) (2.20) where Wl is a n × k matrix, Xl−1 is a k × d matrix, Bl is a n × d matrix, n is the number of neurons in the layer and d is the number of samples in the input tensor (batch dimension). These layers can be stacked after each other creating a deep neural network as seen in figure 2.10. The neural network is initialized with random weights and biases. To allow the neural network to make precise predictions it is trained on known input output pairs {X,Y} and every weight and bias is adjusted using backpropagation gradi- ent descent [44]. In this process every parameter p (weights and biases) is updated according to p′ = −η ∂E(Y,Y′) ∂p . Here, E(Y,Y′) is a chosen error, or loss, function in- dicating the performance of the network and η is a learning rate ensuring appropriate size of the parameter update. 2.4.2 Recurrent neural networks To allow a neural network to handle time series data of variable length it is useful to give it some sort of ”memory”. The recurrent neural network (RNN) [45] allows for this by connecting the hidden state (sequence of hidden layers) h of the network at time t to the hidden state at time t+ 1 as seen in figure 2.11. 15 2. Theory yt h xt Wxh Why Whh Unroll yt−1 yt yt+1 ht−1 ht ht+1 xt−1 xt xt+1 . . . . . . Wxh Wxh Wxh Why Why Why Whh Whh Whh Whh Figure 2.11: Fundamental structure for a recurrent neural net- work. The hidden layer(s) feed their state to the output and to the output for the current time step. The figure shows both the unrolled and rolled up schematic at time t. The hidden state is similar to the feed forward network and the parameters are updated using backpropagation. 2.4.3 Gated recurrent unit ht−1 ht xt ht rt = σ zt = σ nt = tanh ⊙ ⊙ ⊙ + 1− Figure 2.12: Circuit diagram of a single GRU block. rt, zt, nt and ht are calculated according to (2.21). Using fully connected layers for the hidden state of the RNN can run into the vanishing gradient problem for longer time series [46], [47], which makes network training difficult. One approach to mitigating this is to use long short-term memory (LSTM) [48] blocks for the hidden state. LSTM circumvents the vanishing gradient by keeping separate long term and short term hidden states. The gated recurrent unit (GRU) [49] builds on the LSTM but removes the cell state vector, combining 16 2. Theory long and short term memory, giving it the parameters rt = σ(Wxrxt + bxr +Whrh(t−1) + bhr) (2.21a) zt = σ(Wxzxt + bxz +Whzh(t−1) + bhz) (2.21b) nt = tanh(Wxnxt + bxn + rt ⊙ (Whnh(t−1) + bhn)) (2.21c) ht = (1 − zt) ⊙ nt + zt ⊙ h(t−1) (2.21d) σ(x) = 1 1 + exp(−x) Here the reset gate rt decides how much of the previous state to remember. If rt = 0 the new state forgets the entirety of the old state. The update gate zt decides how much of the past hidden state to keep. If zt = 1 we take none of the new state and all of the old state and vice versa. The candidate state nt gives a candidate for a new hidden state before the update gate decides how much of it to accept. Figure 2.12 gives a more illustrative view of the GRU block. 2.4.4 Graph convolutional networks x1 x5 x2 x3 x4 e2,1 e5,1 x′ 1 = f(b1 + W1x1 + W2(e5,1x5 + e2,1x2)) Figure 2.13: The update of a single node feature x′ 1, according to (2.22) during graph convolution. Graph convolutional networks are a type of neural networks that are designed to operate on graphs. Graph convolution layers map every node feature x1, x2, . . . to a new state according to the graph neural network operator [50] x′ i = f bi + W1xi + W2 ∑ j∈Ni ejixj  (2.22) Like other neural networks, the weights and biases (Wi, bi) are trainable parameters initialized randomly and updated using backpropagation. eji are the edge weights of the graph and f is the activation function. The graph neural network is permutation equivariant, meaning that it does not change the structure of the graph - only the node features. Figure 2.13 shows the update rule acting on a single node feature during a graph convolution. 17 2. Theory 2.4.5 Global mean pool The global mean pool (GMP) of a graph G is the mean of every node feature across the graph nodes NG according to GMP(G) = 1 |NG| ∑ x∈NG x (2.23) where |NG| is the number of nodes in the graph. This reduces (pools) the state of the graph into a single vector representation of the same size as a single node feature vector. The fixed size of the pool makes it a suitable embedding between a graph and neural network structures demanding fixed size inputs. 18 3 Methods This chapter covers the methods that were employed when creating our quantum error correction decoder. First, we explain how data for model training and inference was generated. Next, we describe the neural network architecture. Finally, we cover how the neural network was trained. 3.1 Data generation In order to generate data for use during training and inference, the Python package Stim [51] was used. Stim can be used to simulate quantum stabilizer circuits, such as the surface code in our case. When simulating stabilizer circuits, bit-flip and phase-flip errors are applied with error rate p using a noise model called circuit-level noise. This noise model consists of depolarizing noise that applies X, Y , and Z errors with equal probability pX = pY = pZ = p 3 , in addition to bit-flip errors after ancilla qubit measurements and resets, and two-qubit gate errors. After simulating the circuit for T cycles, or time steps, Stim returns a vector of so-called detection events from T + 1 time steps, and a boolean value indicating if the simulation has resulted in a logical bit-flip or phase-flip. This value is used as the label when training and evaluating the neural network. Stim simulates a real experiment and as such, the data qubits at the end of the simulation can only be measured in either the X basis or the Z basis. If the data qubits are measured in the X basis, logical phase-flips can be detected, and if they are measured in the Z basis, logical bit-flips can be detected. It is this final measurement of the data qubits that causes the +1 in the T +1 expression. The detection events are also boolean values, indicating whether each stabilizer measurement differs from its previous state, i.e. activated if there is an odd parity on the surrounding data qubits, not activated otherwise. Having generated a syndrome, a series of graphs each spanning dT time steps are created. The nodes correspond to detection events, each represented by a feature vector [x, y, tr, Z,X] where x, y, tr are node coordinates and X,Z are boolean val- ues indicating stabilizer type. The x and y coordinates indicate the location of the stabilizer on the surface code, x, y ∈ [0, d], the top left being the origin. The tr coordinate indicates the relative time step of the detection event within the graph, tr ∈ [0, dT − 1]. 19 3. Methods Each node has an edge to its twenty nearest neighbours. These edges are created using a k-nearest neighbour algorithm. Generally, the graphs contain less than twenty nodes, and as such are undirected and fully connected. In principle, though, the graphs can contain more than twenty nodes, and in these cases the graphs are not fully connected, and potentially not undirected. The number of nodes in a graph has a positive correlation with the code distance d, the number of time steps the graph spans dT , and the error rate p. Next, the edge weights are created. These are calculated as the square of the inverse supremum norm between two nodes. For instance, the edge weight between nodes i and j, eij, is: eij = (max(|xi − xj|, |yi − yj|, |tri − trj |))−2 (3.1) The graphs are created according to a sliding window schema, such that all nodes (except for nodes in the first and last time step) are present in multiple overlapping graphs, see Figure 3.1. Should a graph be empty, such as if there are no detection events until t = 5 in Figure 3.1, leading to an empty Graph 1, the next graph, i.e. Graph 2, takes its place as the first graph of the syndrome. Therefore, syndromes contain a varying number of graphs for a given syndrome length T . t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 ... Graph 1 Graph 2 Graph 3 Graph 4 Graph 5 Figure 3.1: Diagram showing how graphs are created according to a sliding window schema. The circles represent all nodes that are present in the corresponding time step t. Since a given node can belong to multiple graphs, something has to be done to avoid creating a single graph that spans the entire syndrome. The solution we settled on was making copies of each node. Each node is copied min(t, dT − 1) times, where t corresponds to the time step the node belongs to. Nodes belonging to time step t = 0 are only present in one graph, and therefore do not need to be copied. Nodes belonging to time step t = 1 are present in two graphs, and as such need to be copied once. Next, the time coordinate tr of each node copy is adjusted accordingly. For instance, nodes in time step t = 4 belonging to Graph 1 in Figure 3.1 have time coordinate tr = 4, and the ”same” nodes in Graph 2 instead have time coordinate tr = 3. Each node is also assigned an index indicating to which graph it belongs. This index is used when computing the edges. Unfortunately, creating the syndrome graphs by first copying the nodes takes a sig- nificant amount of time, as the number of nodes is increased by roughly a factor of dT . During training of our neural network, around 30%-50% of the time is dedicated to data generation. Section 3.3 covers this in greater detail. Further, during infer- ence, the vast majority of the time is dedicated to data generation, as the model 20 3. Methods is evaluated on syndromes of much longer length than those on which the model is trained, and this takes a long time. Figure 3.2 shows how the time required to generate syndrome graphs varies with syndrome length T and error rate p for code distances d = 3, 5, 7. 0 200 400 600 800 1000 Number of surface code cycles 0 1 2 3 4 5 6 7 Ti m e (m s) d = 3 d = 5 d = 7 0.001 0.002 0.003 0.004 0.005 Error rate 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Ti m e (m s) d = 3 d = 5 d = 7 Figure 3.2: Time required to generate syndrome graphs based on surface codes of distances d = 3, 5, 7, with dT = 2. The left panel shows how the time to generate syndrome graphs changes as a function of syndrome length, with fixed error rate p = 0.001. The right panel instead shows how the time to generate syndrome graphs changes with error rate p, and fixed syndrome length T = 99. The error bars represent one standard deviation. The results are based on 5 · 103 randomly generated syndromes per data point. 3.2 Neural network architecture The neural network is trained to detect either logical bit-flips or phase-flips. First, the syndrome graphs are fed through a GNN whose purpose is to create a high dimensional feature vector of each node. The GNN consists of a series of graph convolutions that successively increase the size of the feature vectors. The recti- fied linear unit (ReLU) activation function, see (3.2), is used following each graph convolution. ReLU(x) = max(0, x) (3.2) Next, graph embeddings are obtained by averaging the node features of each graph using an operation called global mean pool, see section 2.4.5. The graph embeddings are then fed sequentially through an RNN, which in our case is a multi-layer GRU. At each time step, the GRU outputs several hidden states, one for each layer. These, in combination with the next graph embedding, are used as input to the GRU in the following time step. Finally, the hidden state corresponding to the last layer of the final time step is decoded by feeding it through a dense layer, which maps the hidden state vector to 21 3. Methods a scalar. A sigmoid function is then applied to limit the scalar to the interval (0, 1). In order to determine if a logical bit-flip or phase-flip has occurred, this number is rounded to the nearest integer. A zero indicates that no error has occurred, whereas a one indicates that an error has indeed occurred. It is important to note that a given network can only detect one type or error. If, in the last measurement round, the data qubits are measured in the X basis, the network learns to detect logical phase-flips, and if the data qubits are measured in the Z basis, the network learns to detect logical bit-flips. Therefore, one needs two separate networks to deduce if the syndrome belongs to the logical X , Y , Z, or I coset. We exclusively trained networks to detect logical bit-flips, but it is trivial to train networks to detect logi- cal phase-flips, as the only difference is which data qubits are measured in the last round. Figure 3.3 shows an overview of the model pipeline. 22 3. Methods ZL XL ... ... Graph Graph Graph Graph Graph Embedding Graph Embedding Graph Embedding Graph Embedding GRU GRU GRU GRU Dense out h4 T +1  x y tr Z X   x y tr Z X  Graph Convolutions G M P GRU layer 1: GRU layer 2: GRU layer 3: GRU layer 4: xt−1 xt xt+1 GRU block GRU block GRU block GRU block GRU block GRU block GRU block GRU block GRU block GRU block GRU block GRU block h4 t−1 h4 t h4 t+1 . . . . . . . . . . . . . . . . . . . . . . . . C A B Figure 3.3: Overview of the entire model pipeline. A: Transfer surface codes to graph over all detector events. B: Embedding from graph to embedding tensor using graph convolutions. C: four layer Gated Recurrent Unit. A benefit of using an RNN is that, during inference, the whole syndrome does not need to be fed through the neural network at once, only the part that corresponds to the last graph and the previous hidden state. In other words, decoding a syndrome at time step t = 7999 theoretically takes the same amount of time as decoding a syndrome at t = 99. This is not the case for other decoders, such as MWPM. Our decoder was implemented using PyTorch [52] and PyTorch Geometric (PyG) [53]. PyTorch is a machine learning library that enables efficient tensor computing. PyG is built upon PyTorch and provides functionality to implement graph neural networks. 23 3. Methods 3.3 Training setup Since our decoder is based on a neural network, a significant amount of data is needed to train it. As mentioned in section 3.1, data was obtained by generating it using Stim. Instead of generating fixed training and test sets, data was generated one batch at a time during training. As such, every batch that is fed through the network is unique. This approach has both benefits and drawbacks. First, it de- creases the tendency for the network to overfit on the provided data. On the other hand, a significant amount of time during training is dedicated to data generation. Further, because Stim runs on the CPU, every new batch of graphs that is generated must be transferred to the GPU, which slows down training further. Generating the data and transferring it to the GPU takes roughly 30%-50% of the total training time. Despite not using a fixed training set, the training was split into so-called epochs. Generally, one epoch corresponds to feeding the whole data set through the neural network once. We settled on 256 batches counting as one epoch, where each batch consists of 2048 syndromes. Therefore, one epoch worth of data corresponds to roughly 5 · 105 syndromes. The number of epochs required for the loss to converge is positively correlated with code distance. Chapter 4 covers this in greater detail. It is important to note that we did not make use of a test set as a way of evaluating the performance of the model and the convergence of the loss during training. The reason for this is that with no fixed training set, the model does not overfit on the provided data, as each batch is randomly sampled from a state space much larger than the total number of syndromes generated during training. During training, the network learns to predict a label that is either 0 or 1. The probability of a logical bit-flip or phase-flip error occurring is positively correlated with syndrome length and error rate p. As such, the relative class proportion for short syndromes is skewed towards class 0, i.e. the dataset is imbalanced. This can cause issues when training machine learning models [54]. Figure 3.4 shows the class 0 proportion as a function of syndrome length for code distances 3, 5, and 7. We can see that for shorter syndrome lengths, the dataset is indeed imbalanced. However, the class 0 proportion converges to 0.5 as the syndrome length increases. The models we trained were all based on syndromes with roughly equal class proportions. 24 3. Methods 20 40 60 80 100 Number of surface code cycles 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 Cl as s 0 p ro po rti on d = 3 d = 5 d = 7 Figure 3.4: Proportion of class 0 syndromes as a function of syndrome length for code distances d = 3, 5, 7, with error rate p = 0.003. The error bars represent one standard deviation. The results are based on roughly 2.5·105 randomly generated syndromes per data point. It is important to note that syndromes containing no detection events are always excluded, both during training and inference. Since nodes in the syndrome graphs are based on detection events, the absence of them implies that all the graphs be- longing to a syndrome are empty. Syndromes without any detection events belong to class 0, and are referred to as trivial cases. These cases are also excluded from model performance metrics, such as accuracy. Our decoder was trained using the Adam (Adaptive Moment Estimation) optimizer [55], with betas β1 = 0.9, β2 = 0.999. Adam is similar to ordinary stochastic gradient descent, but it leverages momentum and adaptive learning rates for each parameter. Momentum means that part of the previous update is added onto the next. This helps speed up convergence. The decoder was optimized by minimizing the binary cross-entropy (BCE) loss function: BCE = − 1 N N∑ i=1 [yilog(Oi) + (1 − yi)log(1 −Oi)] (3.3) where N is the number of observations, yi ∈ {0, 1} is the target label, and Oi ∈ (0, 1) is the output from the network. The initial learning rate, η0, was set to 10−3. During training, the learning rate was decreased exponentially to 10−4 according to ηt = max(10−4, η00.95t), where t corresponds to the current epoch. 25 4 Results With a method for obtaining data and a neural network architecture in place, mod- els can be trained and evaluated. In this chapter we first cover the training of the models before presenting their performance. 4.1 Decoder training Three models were trained, one for each distance d ∈ {3, 5, 7}. All three models were trained with a varying error rate p ∈ {0.001, 0.002, 0.003, 0.004, 0.005} that was uniformly sampled on a batch-by-batch basis. Because the number of epochs required for the loss to converge is positively correlated with code distance, the dis- tance 7 model was trained for 500 epochs, the distance 5 model for 300 epochs, and the distance 3 model for 150 epochs. Since each epoch consists of roughly 5 · 105 syndromes, the distance 7 model was trained on 2.5 · 108 syndromes, the distance 5 model on 1.5 · 108 syndromes, and the distance 3 model on 7.5 · 107 syndromes. The three models were trained with slightly different syndrome graph settings. Both the distance 5 and 7 model were trained on syndromes of length T = 49, with sliding window size dT = 2. The distance 3 model was instead trained on syndromes of length T = 99 with sliding window size dT = 5. There are two reasons for this. First, the time required to train models increases with code distance, syndrome length, and sliding window size. The distance 7 model took roughly four days and eighteen hours to train on an Nvidia A40 GPU. Had it instead been trained on syndromes of length T = 99 with dT = 5, it would have taken more than a week. Therefore, in the interest of time, we settled on shorter syndromes. Secondly, we experimented with multiple different syndrome lengths and sliding window sizes, especially on distance 3 models, as they can be trained relatively quickly. We found that models that were trained on syndromes of length T = 49 had similar performance to those trained using T = 99. A few distance 3 models were trained on syndromes of length T = 24, too. While these models performed relatively well on shorter syndromes, they failed to generalize for longer syndromes. The state space grows exponentially with code distance. As such, the amount of information that the neural network needs to encode also grows exponentially. Ini- tially, the models had the same number of parameters regardless of code distance, roughly 5.3 · 105, with a graph embedding vector of length 256, and the hidden 26 4. Results state of the GRU a vector of length 128. This configuration worked well for code distances 3 and 5, but it was not enough for code distance 7. Therefore, the sizes of the graph embedding vector and hidden state vector for the code distance 7 model were doubled to 512 and 256, respectively, increasing the number of parameters to roughly 2.1 · 106. See appendix A.2 for a more detailed breakdown of the number of parameters and architecture of each model. The results from training are summarized in figure 4.1. The left panel shows the accuracy as a function of training epoch. As expected, the distance 3 model con- verges at a lower accuracy compared to the two other models. The right panel shows the mean model output for class 0 and class 1 before rounding, also as a function of training epoch. The model output can be interpreted as the confidence of the models. Based on this interpretation, the distance 3 model is less confident with respect to which input belongs to which class. 0 100 200 300 400 500 Epoch 0.6 0.7 0.8 0.9 Ac cu ra cy d = 3 d = 5 d = 7 0 100 200 300 400 500 Epoch 0.2 0.4 0.6 0.8 M od el o ut pu t d = 3 d = 5 d = 7 Class 1 Class 0 Figure 4.1: Accuracy (left) and model output (right) as a function of training epoch. Each epoch consists of roughly 5 · 105 randomly generated syndromes. 4.2 Decoder performance Having trained the models, it is time to evaluate their performance. The MWPM decoder implemented by the PyMatching library [56] is used as a benchmark for all performance metrics, and is represented by dashed lines in figures 4.2 through 4.4. Figure 4.2 shows the accuracy and failure rate of the three models as a function of surface code cycle. The distance 3 and 5 models both outperform MWPM over the entire domain. However, the distance 7 model has similar or worse performance compared to MWPM. 27 4. Results 0 200 400 600 800 1000 Number of surface code cycles 0.825 0.850 0.875 0.900 0.925 0.950 0.975 1.000 Ac cu ra cy d = 3 d = 5 d = 7 Our model MWPM 0 200 400 600 800 1000 Number of surface code cycles 10 4 10 3 10 2 10 1 Fa ilu re ra te d = 3 d = 5 d = 7 Our model MWPM Figure 4.2: Accuracy (left) and failure rate (right) as a function of surface code cycles, with error rate p = 0.001. The vertical bars indicate standard deviation. All models were tested with 5 · 105 syndromes per data point. Figure 4.3 shows similar data compared to figure 4.2, but here the x-axis is loga- rithmic and the models are evaluated on syndromes up to 104 surface code cycles. As seen in figure 4.3, the performance of the distance 3 model drastically decreases, approaching an accuracy of 0.5. This is equivalent to randomly guessing when the dataset is balanced and there are two classes, as in this case. However, this is not due to poor performance of the decoders, it is an intrinsic limitation of the life-time of the qubit. The distance 5 model continues to confidently outperform MWPM whereas the distance 7 model is noticeably worse than MWPM. While all three models are trained on syndromes shorter than a hundred surface code cycles, the figures suggest that they are able to generalize for much longer time series. The distance 5 model, especially, performs well relative to MWPM for long time series. 101 102 103 104 Number of surface code cycles 0.5 0.6 0.7 0.8 0.9 1.0 Ac cu ra cy d = 3 d = 5 d = 7 Our model MWPM 101 102 103 104 Number of surface code cycles 10 5 10 4 10 3 10 2 10 1 Fa ilu re ra te d = 3 d = 5 d = 7 Our model MWPM Figure 4.3: Accuracy (left) and failure rate (right) as a function of surface code cycles up to 104 cycles, with error rate p = 0.001. The vertical bars indicate standard deviation. All models are tested with 6 · 104 syndromes per data point. Next, the performance of the models is evaluated with respect to error rate p. Figure 4.4 shows model performance for a fixed syndrome length T = 99 as a function of 28 4. Results error rate. The distance 3 and 5 models both outperform MWPM for all error rates p ∈ {0.001, 0.002, 0.003, 0.004, 0.005}. The distance 7 model, on the other hand, has similar performance as compared to MWPM for p = 0.001, but its relative performance decreases as the error rate increases. 0.001 0.002 0.003 0.004 0.005 Error rate 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Ac cu ra cy d = 3 d = 5 d = 7 Our model MWPM 0.001 0.002 0.003 0.004 0.005 Error rate 10 3 10 2 10 1 Fa ilu re ra te d = 3 d = 5 d = 7 Our model MWPM Figure 4.4: Accuracy (left) and failure rate (right) as a function of error rate. The vertical bars indicate standard deviation. All models are tested with syndrome length T = 99 with 5 · 105 syn- dromes per data point. Finally, the time required to decode syndromes is evaluated. While our model is slower than MWPM on any fixed length syndrome decoding, the use of an RNN allows the decoder to process the syndrome at every time step. This means that, instead of accumulating syndrome measurements and decoding the whole syndrome at once, the decoder is able to process information one cycle at a time. The reason for this is the decoder only requires the previous hidden state of the RNN and syndrome measurements from the last dT (dT << T ) time steps as input. This ensures that the decoder processing time per cycle is constant with respect to cycle count. This helps decrease the latency of the model, defined as the time required to decode the syndrome once the final measurement has been fed to the decoder [21], [22], since the model is able to process syndrome measurements during computation. In order to achieve zero latency, the processing time per cycle must be as fast as the clock speed of the quantum computer, which for superconducting qubits is roughly 1µs [21], [22], [57], [58]. Our decoder is around one order of magnitude slower than this, achieving a per cycle processing time of 8.76 ± 0.94µs for the distance 3 model, 11.33 ± 0.66µs for the distance 5 model, and 17.87 ± 0.65µs for the distance 7 model, with error rate p = 0.1%, running on an Nvidia A40 GPU. However, this could be improved by optimizing the model for inference, and running on specialized hardware such as a field-programmable gate array (FPGA). 29 5 Conclusions This work introduced a data-driven approach to decoding the surface code using a combination of graph neural networks and recurrent neural networks. The decoder was trained on hundreds of millions of simulations of the surface code under circuit level noise. It is able to outperform the classical MWPM algorithm on code distances 3 and 5 for thousands of surface code cycles using an error rate p = 0.001. However, it falls behind on code distance 7. Further, our decoder outperforms MWPM for all error rates p ∈ {0.001, 0.002, 0.003, 0.004, 0.005} on code distances 3 and 5 for a fixed syndrome length T = 99. When it comes to decoding time, MWPM is faster than our decoder when decoding a whole syndrome at once. However, due to the recurrent nature of our decoder, it is able to process syndrome measurements one cycle at a time during computation, as opposed to accumulating measurements and decoding them all at once. Further, the processing time per cycle is constant with respect to cycle count. These factors help reduce the latency of the decoder, which is desirable. There are a couple of problems with the approach presented in this work. The pri- mary problem relates to the amount of time required for data generation. Since we chose to always generate new data during training, as opposed to having a fixed data set, data must continuously be generated on the CPU and transferred to the GPU on a batch-by-batch basis during training and inference. This takes a long time, around 30%-50% of the total training time. This problem is exacerbated by the fact that the way in which the syndrome graphs are created in an overlapping fashion also takes a long time. The reason this takes a long time is that all nodes belonging to a batch are put in the same two-dimensional tensor of shape [N, 5] where N is the total number of nodes in a batch and 5 is the number of node features. Ideally, a three-dimensional tensor of shape [B,Ni, 5], where B is the batch size and Ni the number of nodes in syndrome i, would be used, since the node duplication could be vectorized with this approach. However, because the number of nodes varies from syndrome to syndrome, i.e. Ni is not constant across a batch, this does not work. Further, the graph convolution layers [59] of PyTorch Geometric expect a two-dimensional input. As such, much of the node duplication is done inside of a Python for-loop, as opposed to being done in a vectorized fashion, adding a further performance penalty. The neural network architecture described in this paper only decodes the hidden state of the RNN corresponding to the last time step. This is because, by default, Stim only returns a label indicating which class the syndromes belong to for the last time step. However, with a bit of extra work, it is possible to extract the syndrome 30 5. Conclusions class for each time step. This could potentially improve both the performance and training time of the decoder, since all hidden states could be used during training, instead of neglecting all but the last one. Finally, ethical considerations of this work relate to the fact that quantum error cor- recting decoders contribute to enabling reliable quantum computing. While quan- tum computing is not inherently unethical, it can be used for nefarious purposes, e.g. breaking encryption schemes such as RSA and Diffie-Hellman using Shor’s algorithm [4], [5], [60]. Further, the decoder introduced in this work is based on neural networks, specifically leveraging machine learning to learn a mapping from syndrome measurement data to likely error chains. This introduces additional ethi- cal dimensions beyond those tied to quantum computing alone. Neural networks are often characterized as ”black box” models. While they may demonstrate high per- formance, their internal decision-making processes are typically opaque and difficult to interpret. This lack of transparency raises concerns around trust, verification, and accountability, especially in safety-critical or high-stakes domains. 31 5. Conclusions 32 References [1] R. Santagati, A. Aspuru-Guzik, R. Babbush, et al., “Drug design on quantum computers,” Nature Physics, vol. 20, no. 4, pp. 549–557, Mar. 2024, issn: 1745-2481. doi: 10.1038/s41567-024-02411-5. [Online]. Available: http: //dx.doi.org/10.1038/s41567-024-02411-5. [2] H. Shang, L. Shen, Y. Fan, et al., “Large-scale simulation of quantum com- putational chemistry on a new sunway supercomputer,” Dec. 2022. doi: 10. 1109/SC41404.2022.00019. [3] N. Stamatopoulos, D. J. Egger, Y. Sun, et al., “Option pricing using quantum computers,” Quantum, vol. 4, p. 291, Jul. 2020, issn: 2521-327X. doi: 10. 22331/q-2020-07-06-291. [Online]. Available: http://dx.doi.org/10. 22331/q-2020-07-06-291. [4] P. Shor, “Algorithms for quantum computation: Discrete logarithms and fac- toring,” in Proceedings 35th Annual Symposium on Foundations of Computer Science, 1994, pp. 124–134. doi: 10.1109/SFCS.1994.365700. [5] P. W. Shor, “Polynomial-time algorithms for prime factorization and dis- crete logarithms on a quantum computer,” SIAM Journal on Computing, vol. 26, no. 5, pp. 1484–1509, Oct. 1997, issn: 1095-7111. doi: 10 . 1137 / s0097539795293172. [Online]. Available: http://dx.doi.org/10.1137/ S0097539795293172. [6] W. K. Wootters and W. H. Zurek, “A single quantum cannot be cloned,” Nature, vol. 299, no. 5886, pp. 802–803, Oct. 1982. doi: 10.1038/299802a0. [Online]. Available: http://dx.doi.org/10.1038/299802a0. [7] A. Kitaev, “Fault-tolerant quantum computation by anyons,” Annals of Physics, vol. 303, no. 1, pp. 2–30, Jan. 2003, issn: 0003-4916. doi: 10.1016/s0003- 4916(02)00018-0. [Online]. Available: http://dx.doi.org/10.1016/S0003- 4916(02)00018-0. [8] E. Dennis, A. Kitaev, A. Landahl, and J. Preskill, “Topological quantum mem- ory,” Journal of Mathematical Physics, vol. 43, no. 9, pp. 4452–4505, Sep. 2002, issn: 1089-7658. doi: 10.1063/1.1499754. [Online]. Available: http: //dx.doi.org/10.1063/1.1499754. [9] J. Edmonds, “Paths, trees, and flowers,” Canadian Journal of Mathematics, vol. 17, pp. 449–467, 1965. doi: 10.4153/CJM-1965-045-4. [10] D. S. Wang, A. G. Fowler, A. M. Stephens, and L. C. L. Hollenberg, Threshold error rates for the toric and surface codes, 2009. arXiv: 0905.0531 [quant-ph]. [Online]. Available: https://arxiv.org/abs/0905.0531. [11] D. S. Wang, A. G. Fowler, and L. C. L. Hollenberg, “Surface code quantum computing with error rates over 1%,” Phys. Rev. A, vol. 83, p. 020 302, 2 33 https://doi.org/10.1038/s41567-024-02411-5 http://dx.doi.org/10.1038/s41567-024-02411-5 http://dx.doi.org/10.1038/s41567-024-02411-5 https://doi.org/10.1109/SC41404.2022.00019 https://doi.org/10.1109/SC41404.2022.00019 https://doi.org/10.22331/q-2020-07-06-291 https://doi.org/10.22331/q-2020-07-06-291 http://dx.doi.org/10.22331/q-2020-07-06-291 http://dx.doi.org/10.22331/q-2020-07-06-291 https://doi.org/10.1109/SFCS.1994.365700 https://doi.org/10.1137/s0097539795293172 https://doi.org/10.1137/s0097539795293172 http://dx.doi.org/10.1137/S0097539795293172 http://dx.doi.org/10.1137/S0097539795293172 https://doi.org/10.1038/299802a0 http://dx.doi.org/10.1038/299802a0 https://doi.org/10.1016/s0003-4916(02)00018-0 https://doi.org/10.1016/s0003-4916(02)00018-0 http://dx.doi.org/10.1016/S0003-4916(02)00018-0 http://dx.doi.org/10.1016/S0003-4916(02)00018-0 https://doi.org/10.1063/1.1499754 http://dx.doi.org/10.1063/1.1499754 http://dx.doi.org/10.1063/1.1499754 https://doi.org/10.4153/CJM-1965-045-4 https://arxiv.org/abs/0905.0531 https://arxiv.org/abs/0905.0531 References Feb. 2011. doi: 10.1103/PhysRevA.83.020302. [Online]. Available: https: //link.aps.org/doi/10.1103/PhysRevA.83.020302. [12] T. M. Stace, S. D. Barrett, and A. C. Doherty, “Thresholds for topological codes in the presence of loss,” Phys. Rev. Lett., vol. 102, p. 200 501, 20 May 2009. doi: 10.1103/PhysRevLett.102.200501. [Online]. Available: https: //link.aps.org/doi/10.1103/PhysRevLett.102.200501. [13] T. M. Stace and S. D. Barrett, “Error correction and degeneracy in surface codes suffering loss,” Phys. Rev. A, vol. 81, p. 022 317, 2 Feb. 2010. doi: 10.1103/PhysRevA.81.022317. [Online]. Available: https://link.aps.org/ doi/10.1103/PhysRevA.81.022317. [14] A. G. Fowler, Optimal complexity correction of correlated errors in the surface code, 2013. arXiv: 1310 . 0863 [quant-ph]. [Online]. Available: https : / / arxiv.org/abs/1310.0863. [15] A. G. Fowler, Minimum weight perfect matching of fault-tolerant topological quantum error correction in average O(1) parallel time, 2014. arXiv: 1307. 1740 [quant-ph]. [Online]. Available: https://arxiv.org/abs/1307.1740. [16] B. Criger and I. Ashraf, “Multi-path Summation for Decoding 2D Topological Codes,” Quantum, vol. 2, p. 102, Oct. 2018, issn: 2521-327X. doi: 10.22331/ q-2018-10-19-102. [Online]. Available: https://doi.org/10.22331/q- 2018-10-19-102. [17] J. R. Wootton and D. Loss, “High threshold error correction for the surface code,” Phys. Rev. Lett., vol. 109, p. 160 503, 16 Oct. 2012. doi: 10.1103/ PhysRevLett.109.160503. [Online]. Available: https://link.aps.org/doi/ 10.1103/PhysRevLett.109.160503. [18] S. Bravyi, M. Suchara, and A. Vargo, “Efficient algorithms for maximum like- lihood decoding in the surface code,” Phys. Rev. A, vol. 90, p. 032 326, 3 Sep. 2014. doi: 10.1103/PhysRevA.90.032326. [Online]. Available: https: //link.aps.org/doi/10.1103/PhysRevA.90.032326. [19] A. Hutter, J. R. Wootton, and D. Loss, “Efficient markov chain monte carlo algorithm for the surface code,” Phys. Rev. A, vol. 89, p. 022 326, 2 Feb. 2014. doi: 10.1103/PhysRevA.89.022326. [Online]. Available: https://link.aps. org/doi/10.1103/PhysRevA.89.022326. [20] K. Hammar, A. Orekhov, P. W. Hybelius, et al., “Error-rate-agnostic decoding of topological stabilizer codes,” Phys. Rev. A, vol. 105, p. 042 616, 4 Apr. 2022. doi: 10.1103/PhysRevA.105.042616. [Online]. Available: https://link. aps.org/doi/10.1103/PhysRevA.105.042616. [21] R. Acharya, L. Aghababaie-Beni, I. Aleiner, et al., Quantum error correction below the surface code threshold, 2024. arXiv: 2408.13687 [quant-ph]. [On- line]. Available: https://arxiv.org/abs/2408.13687. [22] J. Bausch, A. W. Senior, F. J. H. Heras, et al., “Learning to decode the surface code with a recurrent, transformer-based neural network,” Oct. 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.05900. [23] G. Torlai and R. G. Melko, “Neural decoder for topological codes,” Physical Review Letters, vol. 119, no. 3, Jul. 2017, issn: 1079-7114. doi: 10.1103/ physrevlett.119.030501. [Online]. Available: http://dx.doi.org/10. 1103/PhysRevLett.119.030501. 34 https://doi.org/10.1103/PhysRevA.83.020302 https://link.aps.org/doi/10.1103/PhysRevA.83.020302 https://link.aps.org/doi/10.1103/PhysRevA.83.020302 https://doi.org/10.1103/PhysRevLett.102.200501 https://link.aps.org/doi/10.1103/PhysRevLett.102.200501 https://link.aps.org/doi/10.1103/PhysRevLett.102.200501 https://doi.org/10.1103/PhysRevA.81.022317 https://link.aps.org/doi/10.1103/PhysRevA.81.022317 https://link.aps.org/doi/10.1103/PhysRevA.81.022317 https://arxiv.org/abs/1310.0863 https://arxiv.org/abs/1310.0863 https://arxiv.org/abs/1310.0863 https://arxiv.org/abs/1307.1740 https://arxiv.org/abs/1307.1740 https://arxiv.org/abs/1307.1740 https://doi.org/10.22331/q-2018-10-19-102 https://doi.org/10.22331/q-2018-10-19-102 https://doi.org/10.22331/q-2018-10-19-102 https://doi.org/10.22331/q-2018-10-19-102 https://doi.org/10.1103/PhysRevLett.109.160503 https://doi.org/10.1103/PhysRevLett.109.160503 https://link.aps.org/doi/10.1103/PhysRevLett.109.160503 https://link.aps.org/doi/10.1103/PhysRevLett.109.160503 https://doi.org/10.1103/PhysRevA.90.032326 https://link.aps.org/doi/10.1103/PhysRevA.90.032326 https://link.aps.org/doi/10.1103/PhysRevA.90.032326 https://doi.org/10.1103/PhysRevA.89.022326 https://link.aps.org/doi/10.1103/PhysRevA.89.022326 https://link.aps.org/doi/10.1103/PhysRevA.89.022326 https://doi.org/10.1103/PhysRevA.105.042616 https://link.aps.org/doi/10.1103/PhysRevA.105.042616 https://link.aps.org/doi/10.1103/PhysRevA.105.042616 https://arxiv.org/abs/2408.13687 https://arxiv.org/abs/2408.13687 https://doi.org/10.48550/arXiv.2310.05900 https://doi.org/10.1103/physrevlett.119.030501 https://doi.org/10.1103/physrevlett.119.030501 http://dx.doi.org/10.1103/PhysRevLett.119.030501 http://dx.doi.org/10.1103/PhysRevLett.119.030501 References [24] S. Krastanov and L. Jiang, “Deep neural network probabilistic decoder for stabilizer codes,” Scientific Reports, vol. 7, no. 1, Sep. 2017, issn: 2045-2322. doi: 10.1038/s41598-017-11266-1. [Online]. Available: http://dx.doi. org/10.1038/s41598-017-11266-1. [25] S. Varsamopoulos, B. Criger, and K. Bertels, “Decoding small surface codes with feedforward neural networks,” Quantum Science and Technology, vol. 3, no. 1, p. 015 004, Nov. 2017, issn: 2058-9565. doi: 10.1088/2058- 9565/ aa955a. [Online]. Available: http://dx.doi.org/10.1088/2058- 9565/ aa955a. [26] P. Baireuther, T. E. O’Brien, B. Tarasinski, and C. W. J. Beenakker, “Machine- learning-assisted correction of correlated qubit errors in a topological code,” Quantum, vol. 2, p. 48, Jan. 2018, issn: 2521-327X. doi: 10.22331/q-2018- 01-29-48. [Online]. Available: http://dx.doi.org/10.22331/q-2018-01- 29-48. [27] N. P. Breuckmann and X. Ni, “Scalable neural network decoders for higher dimensional quantum codes,” Quantum, vol. 2, p. 68, May 2018, issn: 2521- 327X. doi: 10.22331/q-2018-05-24-68. [Online]. Available: http://dx. doi.org/10.22331/q-2018-05-24-68. [28] S. Varsamopoulos, K. Bertels, and C. G. Almudever, “Comparing neural net- work based decoders for the surface code,” IEEE Transactions on Computers, vol. 69, no. 2, pp. 300–311, Feb. 2020, issn: 2326-3814. doi: 10.1109/tc. 2019.2948612. [Online]. Available: http://dx.doi.org/10.1109/TC.2019. 2948612. [29] P. Baireuther, M. D. Caio, B. Criger, C. W. J. Beenakker, and T. E. O’Brien, “Neural network decoder for topological color codes with circuit level noise,” New Journal of Physics, vol. 21, no. 1, p. 013 003, Jan. 2019, issn: 1367-2630. doi: 10.1088/1367-2630/aaf29e. [Online]. Available: http://dx.doi.org/ 10.1088/1367-2630/aaf29e. [30] B. M. Varbanov, M. Serra-Peralta, D. Byfield, and B. M. Terhal, “Neural net- work decoder for near-term surface-code experiments,” Phys. Rev. Res., vol. 7, p. 013 029, 1 Jan. 2025. doi: 10.1103/PhysRevResearch.7.013029. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevResearch.7. 013029. [31] H. Jung, I. Ali, and J. Ha, “Convolutional neural decoder for surface codes,” IEEE Transactions on Quantum Engineering, vol. 5, pp. 1–13, 2024. doi: 10.1109/TQE.2024.3419773. [32] V. K, D. S, and S. T, “Rl-qec: Harnessing reinforcement learning for quantum error correction advancements,” in 2024 International Conference on Trends in Quantum Computing and Emerging Business Technologies, 2024, pp. 1–5. doi: 10.1109/TQCEBT59414.2024.10545200. [33] R. Sweke, M. S. Kesselring, E. P. L. van Nieuwenburg, and J. Eisert, “Re- inforcement learning decoders for fault-tolerant quantum computation,” Ma- chine Learning: Science and Technology, vol. 2, no. 2, p. 025 005, Dec. 2020. doi: 10.1088/2632-2153/abc609. [Online]. Available: https://dx.doi. org/10.1088/2632-2153/abc609. 35 https://doi.org/10.1038/s41598-017-11266-1 http://dx.doi.org/10.1038/s41598-017-11266-1 http://dx.doi.org/10.1038/s41598-017-11266-1 https://doi.org/10.1088/2058-9565/aa955a https://doi.org/10.1088/2058-9565/aa955a http://dx.doi.org/10.1088/2058-9565/aa955a http://dx.doi.org/10.1088/2058-9565/aa955a https://doi.org/10.22331/q-2018-01-29-48 https://doi.org/10.22331/q-2018-01-29-48 http://dx.doi.org/10.22331/q-2018-01-29-48 http://dx.doi.org/10.22331/q-2018-01-29-48 https://doi.org/10.22331/q-2018-05-24-68 http://dx.doi.org/10.22331/q-2018-05-24-68 http://dx.doi.org/10.22331/q-2018-05-24-68 https://doi.org/10.1109/tc.2019.2948612 https://doi.org/10.1109/tc.2019.2948612 http://dx.doi.org/10.1109/TC.2019.2948612 http://dx.doi.org/10.1109/TC.2019.2948612 https://doi.org/10.1088/1367-2630/aaf29e http://dx.doi.org/10.1088/1367-2630/aaf29e http://dx.doi.org/10.1088/1367-2630/aaf29e https://doi.org/10.1103/PhysRevResearch.7.013029 https://link.aps.org/doi/10.1103/PhysRevResearch.7.013029 https://link.aps.org/doi/10.1103/PhysRevResearch.7.013029 https://doi.org/10.1109/TQE.2024.3419773 https://doi.org/10.1109/TQCEBT59414.2024.10545200 https://doi.org/10.1088/2632-2153/abc609 https://dx.doi.org/10.1088/2632-2153/abc609 https://dx.doi.org/10.1088/2632-2153/abc609 References [34] P. Andreasson, J. Johansson, S. Liljestrand, and M. Granath, “Quantum error correction for the toric code using deep reinforcement learning,” Quantum, vol. 3, p. 183, Sep. 2019, issn: 2521-327X. doi: 10.22331/q-2019-09-02- 183. [Online]. Available: https://doi.org/10.22331/q-2019-09-02-183. [35] M. Lange, P. Havström, B. Srivastava, et al., Data-driven decoding of quantum error correcting codes using graph neural networks, 2023. arXiv: 2307.01241 [quant-ph]. [Online]. Available: https://arxiv.org/abs/2307.01241. [36] R. Feynman, F. Vernon, and R. Hellwarth, “Geometrical representation of the Schrödinger equation for solving maser problems,” J. Appl. Phys., vol. 28, pp. 49–52, 1957. [37] R. Cleve, A. Ekert, C. Macchiavello, and M. Mosca, “Quantum algorithms revisited,” Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, vol. 454, no. 1969, pp. 339–354, Jan. 1998, issn: 1471-2946. doi: 10.1098/rspa.1998.0164. [Online]. Available: http: //dx.doi.org/10.1098/rspa.1998.0164. [38] J. Roffe, “Quantum error correction: An introductory guide,” Contemporary Physics, vol. 60, no. 3, pp. 226–245, Jul. 2019, issn: 1366-5812. doi: 10.1080/ 00107514.2019.1667078. [Online]. Available: http://dx.doi.org/10.1080/ 00107514.2019.1667078. [39] D. Gottesman, Stabilizer codes and quantum error correction, 1997. arXiv: quant-ph/9705052 [quant-ph]. [Online]. Available: https://arxiv.org/ abs/quant-ph/9705052. [40] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, “Surface codes: Towards practical large-scale quantum computation,” Phys. Rev. A, vol. 86, p. 032 324, 3 Sep. 2012. doi: 10.1103/PhysRevA.86.032324. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.86.032324. [41] Y. Xiao, B. Srivastava, and M. Granath, “Exact results on finite size correc- tions for surface codes tailored to biased noise,” Quantum, vol. 8, p. 1468, Sep. 2024, issn: 2521-327X. doi: 10.22331/q-2024-09-11-1468. [Online]. Available: https://doi.org/10.22331/q-2024-09-11-1468. [42] A. deMarti iOlius, P. Fuentes, R. Orús, P. M. Crespo, and J. Etxezarreta Martinez, “Decoding algorithms for surface codes,” Quantum, vol. 8, p. 1498, Oct. 2024, issn: 2521-327X. doi: 10.22331/q-2024-10-10-1498. [Online]. Available: http://dx.doi.org/10.22331/q-2024-10-10-1498. [43] W. Mcculloch and W. Pitts, “A logical calculus of ideas immanent in nervous activity,” Bulletin of Mathematical Biophysics, vol. 5, pp. 127–147, 1943. [44] S. Linnainmaa, “Algoritmin kumulatiivinen pyöristysvirhe yksittäisten pyöristysvirhei- den taylor-kehitelmänä (the representation of the cumulative rounding er- ror of an algorithm as a taylor expansion of the local rounding errors),” Thesis, 1970. [Online]. Available: https://people.idsia.ch/~juergen/ linnainmaa1970thesis.pdf. [45] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, pp. 179– 211, 1990. [46] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen netzen,” Apr. 1991. 36 https://doi.org/10.22331/q-2019-09-02-183 https://doi.org/10.22331/q-2019-09-02-183 https://doi.org/10.22331/q-2019-09-02-183 https://arxiv.org/abs/2307.01241 https://arxiv.org/abs/2307.01241 https://arxiv.org/abs/2307.01241 https://doi.org/10.1098/rspa.1998.0164 http://dx.doi.org/10.1098/rspa.1998.0164 http://dx.doi.org/10.1098/rspa.1998.0164 https://doi.org/10.1080/00107514.2019.1667078 https://doi.org/10.1080/00107514.2019.1667078 http://dx.doi.org/10.1080/00107514.2019.1667078 http://dx.doi.org/10.1080/00107514.2019.1667078 https://arxiv.org/abs/quant-ph/9705052 https://arxiv.org/abs/quant-ph/9705052 https://arxiv.org/abs/quant-ph/9705052 https://doi.org/10.1103/PhysRevA.86.032324 https://link.aps.org/doi/10.1103/PhysRevA.86.032324 https://doi.org/10.22331/q-2024-09-11-1468 https://doi.org/10.22331/q-2024-09-11-1468 https://doi.org/10.22331/q-2024-10-10-1498 http://dx.doi.org/10.22331/q-2024-10-10-1498 https://people.idsia.ch/~juergen/linnainmaa1970thesis.pdf https://people.idsia.ch/~juergen/linnainmaa1970thesis.pdf References [47] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5, no. 2, pp. 157–166, 1994. doi: 10.1109/72.279181. [48] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Compu- tation, vol. 9, pp. 1735–1780, Nov. 1997. doi: 10.1162/neco.1997.9.8.1735. [49] K. Cho, B. van Merrienboer, C. Gulcehre, et al., Learning phrase represen- tations using rnn encoder-decoder for statistical machine translation, 2014. arXiv: 1406.1078 [cs.CL]. [Online]. Available: https://arxiv.org/abs/ 1406.1078. [50] C. Morris, M. Ritzert, M. Fey, et al., Weisfeiler and leman go neural: Higher- order graph neural networks, 2021. arXiv: 1810 . 02244 [cs.LG]. [Online]. Available: https://arxiv.org/abs/1810.02244. [51] C. Gidney, “Stim: A fast stabilizer circuit simulator,” Quantum, vol. 5, p. 497, Jul. 2021, issn: 2521-327X. doi: 10.22331/q-2021-07-06-497. [Online]. Available: https://doi.org/10.22331/q-2021-07-06-497. [52] J. Ansel, E. Yang, H. He, et al., “Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation,” in Pro- ceedings of the 29th ACM International Conference on Architectural Sup- port for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24), ACM, Apr. 2024. doi: 10.1145/3620665.3640366. [Online]. Available: https://pytorch.org/assets/pytorch2-2.pdf. [53] M. Fey and J. E. Lenssen, Fast graph representation learning with pytorch geometric, 2019. arXiv: 1903 . 02428 [cs.LG]. [Online]. Available: https : //arxiv.org/abs/1903.02428. [54] M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, “Imbalanced data prob- lem in machine learning: A review,” IEEE Access, vol. 13, pp. 13 686–13 699, 2025. doi: 10.1109/ACCESS.2025.3531662. [55] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2017. arXiv: 1412.6980 [cs.LG]. [Online]. Available: https://arxiv.org/abs/ 1412.6980. [56] O. Higgott, Pymatching: A python package for decoding quantum codes with minimum-weight perfect matching, 2021. arXiv: 2105.13082 [quant-ph]. [On- line]. Available: https://arxiv.org/abs/2105.13082. [57] F. Arute, K. Arya, R. Babbush, et al., “Quantum supremacy using a pro- grammable superconducting processor,” Nature, vol. 574, pp. 505–510, 2019. [Online]. Available: https : / / www . nature . com / articles / s41586 - 019 - 1666-5. [58] R. Acharya, I. Aleiner, R. Allen, et al., “Suppressing quantum errors by scaling a surface code logical qubit,” Nature, vol. 614, no. 7949, pp. 676–681, Feb. 2023, issn: 1476-4687. doi: 10.1038/s41586-022-05434-1. [Online]. Available: http://dx.doi.org/10.1038/s41586-022-05434-1. [59] Conv.graphconv, 2025. [Online]. Available: https://pytorch- geometric. readthedocs . io / en / latest / generated / torch _ geometric . nn . conv . GraphConv.html#torch_geometric.nn.conv.GraphConv. [60] M. Kaplan, G. Leurent, A. Leverrier, and M. Naya-Plasencia, “Breaking sym- metric cryptosystems using quantum period finding,” in Advances in Cryp- 37 https://doi.org/10.1109/72.279181 https://doi.org/10.1162/neco.1997.9.8.1735 https://arxiv.org/abs/1406.1078 https://arxiv.org/abs/1406.1078 https://arxiv.org/abs/1406.1078 https://arxiv.org/abs/1810.02244 https://arxiv.org/abs/1810.02244 https://doi.org/10.22331/q-2021-07-06-497 https://doi.org/10.22331/q-2021-07-06-497 https://doi.org/10.1145/3620665.3640366 https://pytorch.org/assets/pytorch2-2.pdf https://arxiv.org/abs/1903.02428 https://arxiv.org/abs/1903.02428 https://arxiv.org/abs/1903.02428 https://doi.org/10.1109/ACCESS.2025.3531662 https://arxiv.org/abs/1412.6980 https://arxiv.org/abs/1412.6980 https://arxiv.org/abs/1412.6980 https://arxiv.org/abs/2105.13082 https://arxiv.org/abs/2105.13082 https://www.nature.com/articles/s41586-019-1666-5 https://www.nature.com/articles/s41586-019-1666-5 https://doi.org/10.1038/s41586-022-05434-1 http://dx.doi.org/10.1038/s41586-022-05434-1 https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GraphConv.html#torch_geometric.nn.conv.GraphConv https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GraphConv.html#torch_geometric.nn.conv.GraphConv https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GraphConv.html#torch_geometric.nn.conv.GraphConv References tology – CRYPTO 2016, M. Robshaw and J. Katz, Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2016, pp. 207–237, isbn: 978-3-662-53008-5. 38 A Appendix A.1 Appendix 1 The code for this project can be found at github.com/Olfj/QEC_GNN-RNN/releases/tag/v1.0 A.2 Appendix 2 Network parameters for the models presented in this thesis. Layer Input Features Output Features Param # GraphConv 1 5 32 352 GraphConv 2 32 64 4,160 GraphConv 3 64 128 16,512 GraphConv 4 128 256 65,792 GRU 256, 128 128 445,440 Linear 128 1 129 Total 532,385 Table A.1: Layer-wise input/output features and parameter counts of the distance 3 and 5 GRUDecoder model. Layer Input Features Output Features Param # GraphConv 1 5 32 352 GraphConv 2 32 64 4,160 GraphConv 3 64 128 16,512 GraphConv 4 128 256 65,792 GraphConv 5 256 512 262,656 GRU 512, 256 256 1,775,616 Linear 256 1 257 Total 2,125,345 Table A.2: Layer-wise input/output features and parameter counts of the larger distance 7 GRUDecoder model. I https://github.com/Olfj/QEC_GNN-RNN/releases/tag/v1.0 DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden www.chalmers.se www.chalmers.se List of Acronyms List of Figures Introduction Theory Graph theory Quantum computing Qubit Pauli operators Quantum circuits Quantum gates Entanglement Quantum error codes Bit-flip repetition code The stabilizer formalism The surface code Minimum-weight perfect matching Neural networks Feed forward, fully connected neural networks Recurrent neural networks Gated recurrent unit Graph convolutional networks Global mean pool Methods Data generation Neural network architecture Training setup Results Decoder training Decoder performance Conclusions References Appendix Appendix 1 Appendix 2