XL

ZL

Sequential Graph-Based Decoding of the
Surface Code using a Hybrid Graph and
Recurrent Neural Network Model
Master’s thesis in Complex Adaptive Systems

OLE FJELDSÅ

GUSTAF JONASSON JOHANSSON

DEPARTMENT OF PHYSICS

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2025
www.chalmers.se

www.chalmers.se


Master’s thesis 2025

Sequential Graph-Based Decoding of the Surface
Code using a Hybrid Graph and Recurrent

Neural Network Model

OLE FJELDSÅ, GUSTAF JONASSON JOHANSSON

Department of Physics
Chalmers University of Technology

Gothenburg, Sweden 2025


Sequential Graph-Based Decoding of the Surface Code using a Hybrid Graph and
Recurrent Neural Network Model

OLE FJELDSÅ
GUSTAF JONASSON JOHANSSON

© OLE FJELDSÅ, GUSTAF JONASSON JOHANSSON, 2025.

Supervisor: Mats Granath, Department of Physics, University of Gothenburg.
Examiner: Mats Granath, Department of Physics, University of Gothenburg.

Master’s Thesis 2025
Department of Physics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Example syndrome with accompanying graphs of size dT = 2 over three
surface code cycles.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2025

iv


Sequential Graph-Based Decoding of the Surface Code using a Hybrid Graph and
Recurrent Neural Network Model

OLE FJELDSÅ
GUSTAF JONASSON JOHANSSON
Department of Physics
Chalmers University of Technology

Abstract
In order to achieve reliable quantum computation with noisy qubits, quantum error
correction (QEC) is necessary. Quantum error correcting codes mitigate the inher-
ent noise in quantum systems by distributing the logical state over several qubits,
thereby introducing redundancy. One such promising code is the surface code. It
encodes the logical qubit using a two-dimensional lattice of physical data qubits
and ancilla qubits. By taking and decoding measurements on the ancilla qubits of
the surface code, one can deduce whether a logical bit- or phase-flip has occurred.
However, this is a complex and potentially time-consuming task. Multiple decoding
algorithms exist, such as the classical minimum-weight perfect matching (MWPM)
decoder. In recent years, data-driven algorithms have been shown to decode the
surface code with a high degree of accuracy. In this thesis, we present a machine
learning approach to decoding the surface code using a combination of graph neural
networks (GNN) and recurrent neural networks (RNN). Specifically, graph represen-
tations are constructed over a short, sliding time window of syndrome measurement
data. Each representation is processed by a GNN and its output is used as a learned
high-dimensional embedding for an RNN. This enables continuous decoding of mea-
surement patterns over longer time series. While the decoder is trained on relatively
short syndromes, it is able to generalize for unseen syndromes and longer time se-
ries, outperforming the classical MWPM algorithm across both short and long time
series. This work opens up a new approach to reliable and potentially fast decoding
of QEC codes.

Keywords: Quantum error correction, graph neural networks, recurrent neural net-
works, gated recurrent unit, surface code.

v


Acknowledgements
We are grateful to Mats Granath and Moritz Lange for their invaluable guidance,
insightful discussions, and patience throughout this project. We also thank Blaž
Pridgar for his technical help and contributions to our model analysis. Finally, we
appreciate Tobias Wallström and Eduardo Manoni for their spirited banter.

Computations were enabled by resources provided by the National Academic Infras-
tructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish
Research Council through grant agreement no. 2022-06725.

Ole Fjeldså and Gustaf Jonasson Johansson, Gothenburg, June 2025

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

CPU Central Processing Unit
FPGA Field Programmable Gate Array
GMP Global Mean Pool
GNN Graph Neural Network
GPU Graphics Processing Unit
GRU Gated Recurrent Unit
LSTM Long Short-Term Memory
MWPM Minimum Weight Perfect Matching
RNN Recurrent Neural Network

ix


Contents

List of Acronyms ix

List of Figures xiii

1 Introduction 1

2 Theory 3
2.1 Graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Quantum computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Qubit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.2 Pauli operators . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Quantum circuits . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Quantum gates . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.5 Entanglement . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Quantum error codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Bit-flip repetition code . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 The stabilizer formalism . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 The surface code . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.4 Minimum-weight perfect matching . . . . . . . . . . . . . . . 13

2.4 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Feed forward, fully connected neural networks . . . . . . . . . 14
2.4.2 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . 15
2.4.3 Gated recurrent unit . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.4 Graph convolutional networks . . . . . . . . . . . . . . . . . . 17
2.4.5 Global mean pool . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Methods 19
3.1 Data generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Neural network architecture . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Training setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Results 26
4.1 Decoder training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Decoder performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Conclusions 30

xi


Contents

References 33

A Appendix I
A.1 Appendix 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Appendix 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

xii


List of Figures

2.1 The Bloch sphere. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Entanglement of two qubits. . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 The two qubit encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 The surface code four-cycle. . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Surface code from four-cycles. . . . . . . . . . . . . . . . . . . . . . . 11
2.6 The simplified surface code schematic. . . . . . . . . . . . . . . . . . 12
2.7 Logical XL, YL and ZL operators on the surface code. . . . . . . . . . 12
2.8 MWPM example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.9 MWPM example on the surface code. . . . . . . . . . . . . . . . . . . 14
2.10 Deep neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.11 Fundamental structure for a recurrent neural network. . . . . . . . . 16
2.12 The gated recurrent unit. . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.13 Graph convolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Sliding window schema. . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Time required to generate syndrome graphs based on surface codes. . 21
3.3 Overview of the entire model pipeline. . . . . . . . . . . . . . . . . . 23
3.4 Proportion of class 0 syndromes as a function of syndrome length. . . 25

4.1 Accuracy and model output as a function of training epoch. . . . . . 27
4.2 Accuracy and failure rate as a function of code cycles. . . . . . . . . . 28
4.3 Accuracy and failure rate as a function of code cycles. . . . . . . . . . 28
4.4 Accuracy and failure rate as a function of error rate. . . . . . . . . . . 29

xiii


List of Figures

xiv


1
Introduction

Quantum computing has the potential to transform how we solve complex prob-
lems in fields such as drug design [1], chemistry [2], and financial market analysis
[3]. Perhaps its most well-known application is prime factorization [4], [5]. Unlike
classical computers, which process information using bits that exist in one of two dis-
tinct states (0 or 1), quantum computers use qubits that can exist in superpositions
of both states simultaneously. This unique property, combined with entanglement
and other quantum phenomena, enables quantum algorithms to outperform classical
methods in certain tasks. However, qubits are highly fragile, with their quantum
state easily disrupted by environmental noise or imperfect control signals. Even
the slightest interference can cause decoherence, collapsing the superposition and
erasing valuable information. To counteract this, quantum error correction (QEC)
is essential for preserving coherence and enabling reliable quantum computation.

To allow for error correction in a classical computer we can use a simple repetition
code. For example, a simple three bit encoding {0, 1} → {000, 111} can correct
a single bit-flip error and detect a two bit-flip error. Unfortunately, such an error
correction scheme is not possible for qubits due to the no-cloning theorem [6], the
presence of both bit-flips and phase-flips, and the wave function collapse resulting
from reading the qubit.

To perform error correction on qubits, stabilizer codes are used. A stabilizer code
is a type of quantum error correcting code that protects quantum information by
encoding so-called logical qubits as a larger set of physical qubits by entanglement.
It is defined by a set of stabilizer operators, which are elements of the Pauli group
that commute with each other and stabilize the encoded logical states. Errors are
detected by measuring these stabilizers without collapsing the quantum state of the
logical qubit. The surface code [7], [8] is one such stabilizer code. It is a topological
error correcting code that protects quantum information by encoding a logical qubit
in a two-dimensional lattice of physical qubits.

Efficiently decoding the surface code is a challenging problem. One widely used
approach is the Minimum Weight Perfect Matching (MWPM) decoder [8]–[16], a
classical method that performs well in general but struggles with correlated X/Z
error rates and imperfect measurements. Maximum likelihood decoders [8], [17]–
[20], on the other hand, achieve excellent accuracy but are computationally infea-
sible due to their exponential time complexity. A promising alternative is machine
learning-based decoders, a number of which have been proposed [21]–[34]. Machine

1


1. Introduction

learning models have the potential to decode rapidly once trained and are adaptable
to different noise models, offering both flexibility and efficiency.

This work builds on previous work from from the Department of Physics at Uni-
versity of Gothenburg [35] using graph neural networks (GNN) to classify the most
likely error class of a graph of stabilizer measurements. We build on that by break-
ing down graphs of longer syndromes into overlapping sub-graphs. These graphs are
then processed by a combination of a GNN and a recurrent neural network (RNN)
to generalize the decoding over variable time frames, in a similar fashion to [22].

We show that the combined GNN-RNN approach outperforms MWPM over long
time series for simulated stabilizers under surface level noise. The decoder is able
to generalize over several thousand measurement cycles while only being trained on
syndromes of 49-99 stabilizer cycles.

2


2
Theory

This chapter provides the necessary theoretical background for this thesis. It cov-
ers the basics of graph theory, quantum computing, quantum error correction and
machine learning with neural networks.

2.1 Graph theory
A directed weighted graph is described by the tuple G = (N,E,w) where N is
a set of nodes, E ⊆ {(ni, nj) | ni, nj ∈ N2, i ̸= j} is a set of directed edges
between nodes and w is a function E → T mapping edges to a weight of type
T . Similarly, an undirected weighted graph has a set of undirected edges E ⊆
{(ni, nj) | ni, nj ∈ N, i ̸= j} between nodes. An undirected weighted graph can be
obtained from a directed weighted graph by ensuring that each edge is bidirectional
and equally weighted in both directions i.e. if (ni, nj) ∈ E then (nj, ni) ∈ E and
w(ni, nj) = w(nj, ni). Graphs serve as versatile models for relational structures,
where nodes represent entities and edges denote relationships. The number of nodes
in a graph |N | is called the order of the graph and the number of edges in a graph
|E| is called the size of the graph. There are several different ways of deciding the
weights of an edge. If for example the nodes are coordinates (x, y), the weights
between them could be the distance between the coordinates. In this example the
coordinates (x, y) would be referred to as the node features of the graph.

2.2 Quantum computing
This section provides the necessary background in quantum computing to build the
concepts of quantum error correction.

2.2.1 Qubit
The qubit is the quantum computer equivalent to the bit in a classical computer.
It represents the basic unit of information in quantum computing. The qubit |ψ⟩ is
written as

|ψ⟩ = α |0⟩ + β |1⟩ =
(
α
β

)

where |0⟩ =
(

1
0

)
, |1⟩ =

(
0
1

)
are orthonormal basis vectors, and α, β are complex

3


2. Theory

values with |α|2 + |β|2 = 1. Here α, β give us the probabilities of measuring a qubit
as |0⟩ or |1⟩. For example, the probability of measuring a zero is equal to |α|2.

X

Y

Z

|

|+

|-

|i

|-i

|0

|1

Figure 2.1: The Bloch sphere. Any state corresponds to a point
on the sphere, the positive and negative z direction represent the
basis states |0⟩ and |1⟩. The position of the state is defined by (2.1).

Figure 2.1 shows a geometric interpretation of the qubit called the Bloch sphere [36]
with the qubit state represented by a point on the sphere as

|ψ⟩ = cos θ2 |0⟩ + eiφ sin θ2 |1⟩ (2.1)

In addition to the basis states we have the notable states |+⟩, |−⟩, |i⟩ and |−i⟩ on
the positive and negative x and y axis of the sphere. All of these states have an
equal probability of being measured as |0⟩ or |1⟩ but have different phases.

Multiple qubits are written using the tensor product ⊗

|ψ⟩1 ⊗ |ψ⟩2 = (α1 |0⟩ + β1 |1⟩) ⊗ (α2 |0⟩ + β2 |1⟩)
= α1α2 |00⟩ + α1β2 |01⟩ + β1α2 |10⟩ + β1β2 |11⟩ (2.2)

For the two qubit case this can be written as a state vector of length four

α |00⟩ + β |01⟩ + γ |10⟩ + δ |11⟩ =


α
β
γ
δ

 (2.3)

4


2. Theory

2.2.2 Pauli operators
The Pauli operators are

1 =
(

1 0
0 1

)
, X =

(
0 1
1 0

)
, Y =

(
0 −i
i 0

)
, Z =

(
1 0
0 −1

)
(2.4)

Each of the operators X, Y, and Z have the effect of flipping the qubit π radians
around the x, y, or z or axis on the Bloch Sphere. As we can see from figure 2.1,
the X operator corresponds to a bit-flip error and the Z operator corresponds to
a phase-flip error. In the context of quantum error correction, the Y operator can
be seen as a combination of the X and Z operator as it corresponds to Y = iXZ.
Applying the X or Z operator to a qubit |ψ⟩ yields

X |ψ⟩ = αX |0⟩ + βX |1⟩ = α |1⟩ + β |0⟩ (2.5a)
Z |ψ⟩ = αZ |0⟩ + βZ |1⟩ = α |0⟩ − β |1⟩ (2.5b)

The Pauli group on a single qubit is defined as the Pauli operators

G1 = {±1,±i1,±X,±iX,±Y,±iY,±Z,±iZ}, (2.6)

with the ±1,±i terms to ensure that G1 is closed under multiplication. Every mem-
ber of G1 has eigenvalues ±1 or ±i, is unitary, hermitian and self-inverse. Addi-
tionally, any two members of the Pauli group satisfy the commutation and anti-
commutation relations

[σj, σk] = σjσk − σkσj = 2iεjklσl (2.7a)
{σj, σk} = σjσk + σkσj = 2δjk1 (2.7b)

where εjkl is the Levi-Civita symbol and δjk is the Kronecker delta

εjkl =


1 if (j, k, l) ∈ {(x, y, z), (y, z, x), (z, x, y)} (even permutation)

−1 if (j, k, l) ∈ {(x, z, y), (y, x, z), (z, y, x)} (odd permutation)
0 otherwise

δjk =
{

1 if j = k
0 otherwise

The general Pauli group is made up of all operators formed from tensor products
of elements from G1. If we for example have a three qubit system we can apply the
operation

1 ⊗X ⊗ Y = X2Y3,

where the left hand side is the tensor product of three operations (applied to their
respective qubit), and the right hand side is the support notation of the same oper-
ation.

5


2. Theory

2.2.3 Quantum circuits
Qubits and operators are visualized using quantum circuit diagrams. Each qubit is
represented by a horizontal line going from left to right through time. The different
operations applied to the qubits are placed along these lines. Single qubit operations
like H, X, and Z are represented as a single letter in a box placed on the line as
in (2.8). Multiple qubit gates are represented using vertical connections between
qubits as in (2.9) or rectangles covering multiple lines as in figure 2.3. Single lines
represent qubits while double lines represent classical information, the measurement

of a qubit is represented by the meter symbol .

2.2.4 Quantum gates
Besides the Pauli operators, some additional quantum operators are needed, namely
the Hadamard gate H and the controlled X and Z gates CNOT and CZ. The
Hadamard gate is defined as

H = 1√
2

(
1 1
1 −1

)
|ψ⟩ H (2.8)

It has the effect H |0⟩ = |+⟩ , H |1⟩ = |−⟩, and vice versa. In this way the Hadamard
gate can be used to bring a qubit in and out of superposition. The CNOT and CZ
gates are defined as

CNOT =


1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0

 control
target

(2.9)

CZ =


1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 −1

 control
target

(2.10)

These gates work as conditional X and Z gates. If the control is equal to |1⟩, apply
X or Z to the target qubit. If the control and target are in superposition the CNOT
and CZ gates have the effects

CNOT (α |00⟩ + β |01⟩ + δ |10⟩ + γ |11⟩) = α |00⟩ + β |01⟩ + γ |10⟩ + δ |11⟩ (2.11)
CZ(α |00⟩ + β |01⟩ + δ |10⟩ + γ |11⟩) = α |00⟩ + β |01⟩ + δ |10⟩ − γ |11⟩ (2.12)

One important phenomenon enabled by controlled qubits is phase kickback, which
refers to controlled operations having effects on both the target and control qubit
[37]. This is essential to QEC and exemplified in section 2.3.1.

6


2. Theory

2.2.5 Entanglement
One of the distinguishing features of quantum systems is the ability to entangle. In
lieu of duplicating quantum information as a way of introducing redundancy, as that
is not possible due to the no-cloning theorem [38], entanglement is utilized. Using
entanglement, it is possible to encode information with multi-particle superpositions
[37]. Figure 2.2 shows how one can entangle two qubits. As we can see from the
figure, measuring one of the qubits at the end of the circuit collapses the state of
the other qubit into a corresponding state due to entanglement.

|ψ⟩

|0⟩


α
0
β
0



α
0
0
β



Figure 2.2: Entanglement of two qubits. The state vectors before
and after the CNOT gate denote the states α |00⟩ + β |10⟩ and
α |00⟩ + β |11⟩ respectively.

2.3 Quantum error codes
In order to facilitate reliable QEC, one needs to introduce redundancy in the way
a qubit is represented. In practice, this means distributing the logical state over
multiple qubits. If only one qubit is used, it is not possible to deduce if its value
is the result of some computation, or an error. Using multiple qubits increases the
reliability of the system by taking advantage of the fact that an error happening in
multiple qubits is less likely than an error happening in just one qubit [38]. This
section covers the basics of QEC.

7


2. Theory

2.3.1 Bit-flip repetition code

|ψ⟩1

E

Z1

|0⟩2 Z2

|0⟩A H H

|ψ⟩L E |ψ⟩L

Figure 2.3: The two qubit encoder. The two qubits |ψ⟩1 and |0⟩2
are entangled using the CNOT gate as seen in figure 2.2 creating
the logical qubit |ψ⟩L. An error from table 2.1 is applied to |ψ⟩L.
The ancilla qubit in the |+⟩A state is used to perform the projective
Z1Z2 measurement on |ψ⟩L.

To allow for redundancy in a quantum code we can not simply copy bits [6]. Instead,
we entangle the state of the qubits, expanding their Hilbert space. This spreads the
state of a qubit |ψ⟩ over the logical state |ψ⟩L. Here we show the two qubit encoder
as seen in figure 2.3 to exemplify this.

|ψ⟩ two qubit encoder−−−−−−−−−−→ |ψ⟩L = α |00⟩ + β |11⟩ = α |0⟩L + β |1⟩L .

This has the effect of changing the Hilbert space of the qubit. Before encoding, the
qubit is parametrized within |ψ⟩ ∈ H2 = span{|0⟩ , |1⟩}. After the encoding this
changes to the Hilbert space

|ψ⟩L ∈ H4 = span{|00⟩ , |01⟩ , |10⟩ , |11⟩}

that we can split up into two mutually orthogonal subspaces

C = span{|00⟩ , |11⟩}, F = span{|01⟩ , |10⟩}.

In the two qubit system we have the following bit-flip errors:

Error 1112 X112 11X2 X1X2
Syndrome 0 1 1 0

Table 2.1: Syndrome table for the two qubit code.

If we now apply the bit flip X1 to |ψ⟩L we get

X1 |ψ⟩L = α |10⟩ + β |01⟩ ∈ F ⊂ H4

In addition to the logical qubit, the two qubit encoder utilizes a third qubit called
the ancilla qubit. The ancilla qubit is initialized in the |0⟩A state before applying
the Hadamard gate to it, changing it to the state |+⟩A. At this point the ancilla is

8


2. Theory

prepared and ready to be used in the projective measurement Z1Z2 using two CZ
gates. The Z1Z2 operator applied to |ψ⟩L yields the eigenvalue (+1)

Z1Z2 |ψ⟩L = Z1Z2(α |00⟩ + β |11⟩) = (+1) |ψ⟩L (2.13)

The Z1Z2 operator is said to stabilize |ψ⟩L as it leaves it unchanged. If we instead
apply the Z1Z2 operator to X1 |ψ⟩L we end up with the eigenvalue (−1)

Z1Z2X1 |ψ⟩L = Z1Z2(αX1 |00⟩ + βX1 |11⟩) = (−1)X1 |ψ⟩L (2.14)

This would also be the case for the bit-flip error X2 |ψ⟩L. Since the ancilla qubit is
used as a control for the Z1Z2 operator we get the state

1√
2

((1112)(E |ψ⟩L) |0⟩A + Z1Z2(E |ψ⟩L) |1⟩A) (2.15)

After applying the second Hadamard gate to the ancilla qubit we end up in the state

1
2((1112 + Z1Z2)(E |ψ⟩L) |0⟩A + (1112 − Z1Z2)(E |ψ⟩L) |1⟩A) (2.16)

If E |ψ⟩L ∈ F we get the eigenvalue Z1Z2E |ψ⟩L = (−1)E |ψ⟩L and the first term
goes to zero. If instead E |ψ⟩L ∈ C then Z1Z2E |ψ⟩L = (+1)E |ψ⟩L and the second
term goes to zero. Thus, measuring the ancilla qubit in the state |1⟩A informs us
that E ∈ {X112,11X2} and measuring the ancilla in the state |0⟩A informs us that
E ∈ {1112, X1X2}.

2.3.2 The stabilizer formalism
The stabilizer formalism describes the formal requirements of a general [[n, k, d]]
stabilizer code [39]. Here n is the total number of data qubits, k is the number of
logical qubits and d is the stabilizer code distance. The code distance of a quantum
error correction code is the minimum number of physical qubits that can go unde-
tected and cause a logical error. In the case of the two qubit encoder we can detect
but not correct a single bit-flip meaning that if the qubit was only susceptible to
bit-flip errors it would be of distance d = 2. As qubits are susceptible to phase-flip
errors as well, we need a stabilizer code to take the logical Z operator into account
when determining the code distance.

The stabilizers S of an [[n, k, d]] code is a subgroup of the n-qubit Pauli group Gn.
Gn is made up of all operators of length n formed from tensor products of elements
from G1(2.6). As with G1, the members of Gn either commute or anti-commute. All
stabilizers must stabilize all logical states of the code space C. They must also com-
mute with each other. The requirement for commutation is intersecting on an even
number (including zero) of data qubits. This allows the stabilizers to be measured
simultaneously [38].

Generally, any code that adheres to the stabilizer formalism can be used to, at most,
detect an error of length d− 1, and correct an error of length d−1

2 .

9


2. Theory

2.3.3 The surface code
A code that can detect both bit-flips and phase-flips is the so-called surface code.
Originally defined on a torus [7], [8], it can instead be created using a two-dimensional
lattice of qubits with two different types of boundaries [40]. The surface code four-
cycle is the building block of the surface code. Figure 2.4 shows the quantum circuit
diagram of the four-cycle and a simplified representation of the quantum circuit.

D1

D2A2

A1

|D1⟩

|D2⟩

|0⟩A1 H H

|0⟩A2 H H

Figure 2.4: Surface code four-cycle. The left hand side diagram
shows a simplified representation of the quantum circuit diagram
on the right. Pink represents X stabilizers while blue represents
Z stabilizers. The data qubits are represented by circles and the
ancilla qubits by squares.

While X and Z anti-commute (XZ = −ZX) the stabilizers XD1XD2 and ZD1ZD2

commute as they act on two shared qubits.

XD1XD2ZD1ZD2 = XD1ZD1XD2ZD2

= −ZD1XD1XD2ZD2

= (−1)2ZD1XD1ZD2XD2

= ZD1ZD2XD1XD2

(2.17)

Using the four-cycle we can build the surface code, the left hand side of figure 2.5
shows the [[13, 1, 3]] planar surface code. As we can see from the diagram, each
stabilizer commutes as they still intersect non-trivially on an even number of qubits.
From here we can reduce the cost of a logical qubit by rotating the planar surface
code as seen on the right hand side of figure 2.5. This reduces the amount of physical
qubits needed to encode the logical qubit while keeping the code distance d = 3. As
we can see from the diagram, the required commutation relations still hold.

10


2. Theory

D1 A1 D2 A2 D3

A3 D4 A4 D5 A5

D6 A6 D7 A7 D8

A8 D9 A9 D10 A10

D11 A11 D12 A12 D13

D1

A1

D2

A2

D3

A3

D4

A4

D5

A5

D6

A6

D7

A7

D8

A8

D9

A9

D10

A10

D11

A11

D12

A12

D13

Figure 2.5: Left: a tiling of the surface code four-cycle 2.4 creat-
ing the planar [[13, 1, 3]] surface code. Right: the [[9, 1, 3]] rotated
surface code made by rotating the four-cycle tiling.

After constructing the rotated surface code from the surface code four cycle, the
schematic is further simplified to the representation shown in figure 2.6, this is the
representation that will be used in the rest of this thesis. Additionally, the rotated
surface code will simply be referred to as the surface code.

11


2. Theory

ZL

XL

Y

X

Z

Figure 2.6: The simplified schematic for the [[25, 1, 5]] surface
code. Data qubits are represented by the open circles, while sta-
bilizers are represented by the squares and ”lobes” of the code.
Dark gray colouration represents X stabilizers while light gray rep-
resents Z stabilizers. Pink represents activated X stabilizers while
blue represents activated Z stabilizers. Errors are represented by
the letter on their respective data qubits. The logical X and Z
operators are along the left and top border.

On the surface code, the logical X and Z operators must still satisfy the anti-
commutation relationship XLZL = −ZLXL. Additionally, XL and ZL must com-
mute with all the stabilizers (preserve the eigenvalue of the ancilla qubits). This is
achieved by letting XL and ZL span the surface code along its left and upper border.
Furthermore, YL is defined as a combination of XL and ZL by replacing the top left
corner data qubit with the Y operator and placing X and Z operators along the rest
of their respective borders. Figure 2.7 shows an overview of the logical operators on
the surface code.

x

x

x

x

x z z z z z

x

x

x

x

Y z z z z

Figure 2.7: The logical X (left), Z (middle), and Y (right) opera-
tors on the surface code.

The logical error classes are the sets of errors having the same effect on the state of
the logical qubit as a logical operator. As such, error chains with an odd parity of
X operators along the upper border belong to the X coset, and error chains with

12


2. Theory

an odd parity of Z operators along the left border belong to the Z coset. As with
the logical X/Z operators, error chains with an odd parity along both the left and
upper border belong to the Y coset [41]. Error chains with an even parity along
the logical operators instead belong to the I coset. Figure 2.6 shows a syndrome
belonging to the logical Y coset.

As the stabilizers of a quantum error correction code can only show the parity of
the data qubits surrounding them, and not the error chain itself, the purpose of a
quantum error correction decoder is to decide the most likely chain of errors and
based on the error chain deduce to which logical coset the syndrome belongs.

2.3.4 Minimum-weight perfect matching

A matching of a graph G is a set of edges such that no edge shares a node with
another edge. A perfect matching includes every node in G. In order for the perfect
matching to be minimum-weight, it must minimize the sum of the edge weights.
Figure 2.8 shows an example of the minimum-weight perfect matching (MWPM) of
a graph.

(0, 0)

(4, 3)(1, 3)

(3, 0)

√
10

3

3

√
10

(0, 0)

(4, 3)(1, 3)

(3, 0)
3

3

MWPM

Figure 2.8: Creating the MWPM of a graph. Here the nodes are
2D Cartesian coordinates and the edge weights are the distances
between them.

The MWPM of a graph can be found using the blossom algorithm [9] developed by
Jack Edmonds in the 1960s, and it turns out this algorithm can be used to decode
quantum stabilizer codes. In QEC literature it is known as the MWPM decoder
[42]. Figure 2.9 shows an example of a matching using MWPM on the surface code.
Note the bright coloured ”virtual stabilizers” on the boundary of the surface code.
If an odd number of X or Z stabilizers are activated, the virtual boundary stabilizers
facilitate the creation of a matching using MWPM, since an even number of nodes
is required for this. Additionally, the virtual stabilizers allow for matchings to the
edge of the surface code even when an even number of stabilizers are activated. As
seen in figure 2.9, all virtual stabilizers do not need to be part of the matching, they
are only used as required. After the matching of the surface code is created, it is
used to deduce the most likely error chains.

13


2. Theory

ZL

XL

z z

z

z z

z

z z

z

Figure 2.9: Left: surface code with a syndrome from the logical
coset Z. Middle: the same syndrome with the fully connected graph
between all X stabilizers and all X virtual boundary checks. Right:
a matching using MWPM of the syndrome. The light blue and pink
plaquettes outside the surface code represent the additional virtual
boundary checks.

2.4 Neural networks
This project is based on developing a QEC decoder using a data-driven approach.
In particular, our decoder is based on neural networks. As such, this section covers
the basics of neural networks as it relates to this work.

2.4.1 Feed forward, fully connected neural networks

x0

x1

...

xnx
...

. . .

...

y′
0

y′
1

...

y′
ny

Figure 2.10: Deep neural network with input X =
{x0, x1, . . . , xnx}, a sequence of hidden layers, and an output Y′ =
{y′

0, y
′
1, . . . , y

′
ny}. Each layer Xl is updated according to (2.20) with

the input of each layer being the output of the previous.

14


2. Theory

The building block of neural networks is the (artificial) neuron, first proposed as a
mathematical model for the nervous system [43]. The value of a neuron N can be
written as

N(X) = f

b+
∑

xj∈X
xjwj

 (2.18)

where X is the input vector to the neuron, wj is the weight from input xj to the
neuron, b is the bias for the neuron, and f is a (typically non-linear) activation
function. From this we can create a layer of neurons with output Xl taking the
previous layer Xl−1 as input

Xl(Xl−1) =


N0
N1
...
Nn

 , where Ni(Xl−1) = f

bi +
∑

xj∈Xl−1

xjwji

 (2.19)

This can be rewritten as the matrix operation

Xl(Xl−1) = f (WlXl−1 + Bl) (2.20)

where Wl is a n × k matrix, Xl−1 is a k × d matrix, Bl is a n × d matrix, n is the
number of neurons in the layer and d is the number of samples in the input tensor
(batch dimension). These layers can be stacked after each other creating a deep
neural network as seen in figure 2.10.

The neural network is initialized with random weights and biases. To allow the
neural network to make precise predictions it is trained on known input output
pairs {X,Y} and every weight and bias is adjusted using backpropagation gradi-
ent descent [44]. In this process every parameter p (weights and biases) is updated
according to p′ = −η ∂E(Y,Y′)

∂p
. Here, E(Y,Y′) is a chosen error, or loss, function in-

dicating the performance of the network and η is a learning rate ensuring appropriate
size of the parameter update.

2.4.2 Recurrent neural networks

To allow a neural network to handle time series data of variable length it is useful
to give it some sort of ”memory”. The recurrent neural network (RNN) [45] allows
for this by connecting the hidden state (sequence of hidden layers) h of the network
at time t to the hidden state at time t+ 1 as seen in figure 2.11.

15


2. Theory

yt

h

xt

Wxh

Why

Whh

Unroll

yt−1 yt yt+1

ht−1 ht ht+1

xt−1 xt xt+1

. . . . . .

Wxh Wxh Wxh

Why Why Why

Whh Whh Whh Whh

Figure 2.11: Fundamental structure for a recurrent neural net-
work. The hidden layer(s) feed their state to the output and to
the output for the current time step. The figure shows both the
unrolled and rolled up schematic at time t.

The hidden state is similar to the feed forward network and the parameters are
updated using backpropagation.

2.4.3 Gated recurrent unit

ht−1

ht

xt

ht

rt = σ zt = σ nt = tanh

⊙

⊙

⊙

+

1−

Figure 2.12: Circuit diagram of a single GRU block. rt, zt, nt and
ht are calculated according to (2.21).

Using fully connected layers for the hidden state of the RNN can run into the
vanishing gradient problem for longer time series [46], [47], which makes network
training difficult. One approach to mitigating this is to use long short-term memory
(LSTM) [48] blocks for the hidden state. LSTM circumvents the vanishing gradient
by keeping separate long term and short term hidden states. The gated recurrent
unit (GRU) [49] builds on the LSTM but removes the cell state vector, combining

16


2. Theory

long and short term memory, giving it the parameters

rt = σ(Wxrxt + bxr +Whrh(t−1) + bhr) (2.21a)
zt = σ(Wxzxt + bxz +Whzh(t−1) + bhz) (2.21b)
nt = tanh(Wxnxt + bxn + rt ⊙ (Whnh(t−1) + bhn)) (2.21c)
ht = (1 − zt) ⊙ nt + zt ⊙ h(t−1) (2.21d)

σ(x) = 1
1 + exp(−x)

Here the reset gate rt decides how much of the previous state to remember. If rt = 0
the new state forgets the entirety of the old state. The update gate zt decides how
much of the past hidden state to keep. If zt = 1 we take none of the new state and
all of the old state and vice versa. The candidate state nt gives a candidate for a
new hidden state before the update gate decides how much of it to accept. Figure
2.12 gives a more illustrative view of the GRU block.

2.4.4 Graph convolutional networks

x1

x5

x2

x3

x4

e2,1

e5,1

x′
1 = f(b1 + W1x1 + W2(e5,1x5 + e2,1x2))

Figure 2.13: The update of a single node feature x′
1, according to

(2.22) during graph convolution.

Graph convolutional networks are a type of neural networks that are designed to
operate on graphs. Graph convolution layers map every node feature x1, x2, . . . to a
new state according to the graph neural network operator [50]

x′
i = f

bi + W1xi + W2
∑

j∈Ni

ejixj

 (2.22)

Like other neural networks, the weights and biases (Wi, bi) are trainable parameters
initialized randomly and updated using backpropagation. eji are the edge weights of
the graph and f is the activation function. The graph neural network is permutation
equivariant, meaning that it does not change the structure of the graph - only the
node features. Figure 2.13 shows the update rule acting on a single node feature
during a graph convolution.

17


2. Theory

2.4.5 Global mean pool
The global mean pool (GMP) of a graph G is the mean of every node feature across
the graph nodes NG according to

GMP(G) = 1
|NG|

∑
x∈NG

x (2.23)

where |NG| is the number of nodes in the graph. This reduces (pools) the state of
the graph into a single vector representation of the same size as a single node feature
vector. The fixed size of the pool makes it a suitable embedding between a graph
and neural network structures demanding fixed size inputs.

18


3
Methods

This chapter covers the methods that were employed when creating our quantum
error correction decoder. First, we explain how data for model training and inference
was generated. Next, we describe the neural network architecture. Finally, we cover
how the neural network was trained.

3.1 Data generation
In order to generate data for use during training and inference, the Python package
Stim [51] was used. Stim can be used to simulate quantum stabilizer circuits, such
as the surface code in our case. When simulating stabilizer circuits, bit-flip and
phase-flip errors are applied with error rate p using a noise model called circuit-level
noise. This noise model consists of depolarizing noise that applies X, Y , and Z
errors with equal probability pX = pY = pZ = p

3 , in addition to bit-flip errors after
ancilla qubit measurements and resets, and two-qubit gate errors.

After simulating the circuit for T cycles, or time steps, Stim returns a vector of
so-called detection events from T + 1 time steps, and a boolean value indicating if
the simulation has resulted in a logical bit-flip or phase-flip. This value is used as
the label when training and evaluating the neural network. Stim simulates a real
experiment and as such, the data qubits at the end of the simulation can only be
measured in either the X basis or the Z basis. If the data qubits are measured in the
X basis, logical phase-flips can be detected, and if they are measured in the Z basis,
logical bit-flips can be detected. It is this final measurement of the data qubits that
causes the +1 in the T +1 expression. The detection events are also boolean values,
indicating whether each stabilizer measurement differs from its previous state, i.e.
activated if there is an odd parity on the surrounding data qubits, not activated
otherwise.

Having generated a syndrome, a series of graphs each spanning dT time steps are
created. The nodes correspond to detection events, each represented by a feature
vector [x, y, tr, Z,X] where x, y, tr are node coordinates and X,Z are boolean val-
ues indicating stabilizer type. The x and y coordinates indicate the location of the
stabilizer on the surface code, x, y ∈ [0, d], the top left being the origin. The tr
coordinate indicates the relative time step of the detection event within the graph,
tr ∈ [0, dT − 1].

19


3. Methods

Each node has an edge to its twenty nearest neighbours. These edges are created
using a k-nearest neighbour algorithm. Generally, the graphs contain less than
twenty nodes, and as such are undirected and fully connected. In principle, though,
the graphs can contain more than twenty nodes, and in these cases the graphs are
not fully connected, and potentially not undirected. The number of nodes in a graph
has a positive correlation with the code distance d, the number of time steps the
graph spans dT , and the error rate p. Next, the edge weights are created. These
are calculated as the square of the inverse supremum norm between two nodes. For
instance, the edge weight between nodes i and j, eij, is:

eij = (max(|xi − xj|, |yi − yj|, |tri
− trj

|))−2 (3.1)

The graphs are created according to a sliding window schema, such that all nodes
(except for nodes in the first and last time step) are present in multiple overlapping
graphs, see Figure 3.1. Should a graph be empty, such as if there are no detection
events until t = 5 in Figure 3.1, leading to an empty Graph 1, the next graph, i.e.
Graph 2, takes its place as the first graph of the syndrome. Therefore, syndromes
contain a varying number of graphs for a given syndrome length T .

t = 0 t = 1 t = 2 t = 3 t = 4 t = 5 t = 6 t = 7 t = 8 ...

Graph 1 Graph 2 Graph 3 Graph 4 Graph 5

Figure 3.1: Diagram showing how graphs are created according
to a sliding window schema. The circles represent all nodes that
are present in the corresponding time step t.

Since a given node can belong to multiple graphs, something has to be done to avoid
creating a single graph that spans the entire syndrome. The solution we settled on
was making copies of each node. Each node is copied min(t, dT − 1) times, where
t corresponds to the time step the node belongs to. Nodes belonging to time step
t = 0 are only present in one graph, and therefore do not need to be copied. Nodes
belonging to time step t = 1 are present in two graphs, and as such need to be
copied once. Next, the time coordinate tr of each node copy is adjusted accordingly.
For instance, nodes in time step t = 4 belonging to Graph 1 in Figure 3.1 have time
coordinate tr = 4, and the ”same” nodes in Graph 2 instead have time coordinate
tr = 3. Each node is also assigned an index indicating to which graph it belongs.
This index is used when computing the edges.

Unfortunately, creating the syndrome graphs by first copying the nodes takes a sig-
nificant amount of time, as the number of nodes is increased by roughly a factor of
dT . During training of our neural network, around 30%-50% of the time is dedicated
to data generation. Section 3.3 covers this in greater detail. Further, during infer-
ence, the vast majority of the time is dedicated to data generation, as the model

20


3. Methods

is evaluated on syndromes of much longer length than those on which the model
is trained, and this takes a long time. Figure 3.2 shows how the time required to
generate syndrome graphs varies with syndrome length T and error rate p for code
distances d = 3, 5, 7.

0 200 400 600 800 1000
Number of surface code cycles

0

1

2

3

4

5

6

7

Ti
m

e 
(m

s)

d = 3
d = 5
d = 7

0.001 0.002 0.003 0.004 0.005
Error rate

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Ti
m

e 
(m

s)

d = 3
d = 5
d = 7

Figure 3.2: Time required to generate syndrome graphs based
on surface codes of distances d = 3, 5, 7, with dT = 2. The left
panel shows how the time to generate syndrome graphs changes as
a function of syndrome length, with fixed error rate p = 0.001. The
right panel instead shows how the time to generate syndrome graphs
changes with error rate p, and fixed syndrome length T = 99. The
error bars represent one standard deviation. The results are based
on 5 · 103 randomly generated syndromes per data point.

3.2 Neural network architecture
The neural network is trained to detect either logical bit-flips or phase-flips. First,
the syndrome graphs are fed through a GNN whose purpose is to create a high
dimensional feature vector of each node. The GNN consists of a series of graph
convolutions that successively increase the size of the feature vectors. The recti-
fied linear unit (ReLU) activation function, see (3.2), is used following each graph
convolution.

ReLU(x) = max(0, x) (3.2)

Next, graph embeddings are obtained by averaging the node features of each graph
using an operation called global mean pool, see section 2.4.5. The graph embeddings
are then fed sequentially through an RNN, which in our case is a multi-layer GRU.
At each time step, the GRU outputs several hidden states, one for each layer. These,
in combination with the next graph embedding, are used as input to the GRU in
the following time step.

Finally, the hidden state corresponding to the last layer of the final time step is
decoded by feeding it through a dense layer, which maps the hidden state vector to

21


3. Methods

a scalar. A sigmoid function is then applied to limit the scalar to the interval (0, 1).
In order to determine if a logical bit-flip or phase-flip has occurred, this number is
rounded to the nearest integer. A zero indicates that no error has occurred, whereas
a one indicates that an error has indeed occurred. It is important to note that a
given network can only detect one type or error. If, in the last measurement round,
the data qubits are measured in the X basis, the network learns to detect logical
phase-flips, and if the data qubits are measured in the Z basis, the network learns
to detect logical bit-flips. Therefore, one needs two separate networks to deduce if
the syndrome belongs to the logical X , Y , Z, or I coset. We exclusively trained
networks to detect logical bit-flips, but it is trivial to train networks to detect logi-
cal phase-flips, as the only difference is which data qubits are measured in the last
round. Figure 3.3 shows an overview of the model pipeline.

22


3. Methods

ZL

XL

...

...

Graph

Graph

Graph

Graph

Graph Embedding

Graph Embedding

Graph Embedding

Graph Embedding

GRU

GRU

GRU

GRU

Dense

out

h4
T +1



x

y

tr
Z

X





x
y

tr
Z

X



Graph Convolutions

G
M
P

GRU layer 1:

GRU layer 2:

GRU layer 3:

GRU layer 4:

xt−1 xt xt+1

GRU block

GRU block

GRU block

GRU block

GRU block

GRU block

GRU block

GRU block

GRU block

GRU block

GRU block

GRU block

h4
t−1 h4

t h4
t+1

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

C

A

B

Figure 3.3: Overview of the entire model pipeline. A: Transfer
surface codes to graph over all detector events. B: Embedding from
graph to embedding tensor using graph convolutions. C: four layer
Gated Recurrent Unit.

A benefit of using an RNN is that, during inference, the whole syndrome does not
need to be fed through the neural network at once, only the part that corresponds to
the last graph and the previous hidden state. In other words, decoding a syndrome
at time step t = 7999 theoretically takes the same amount of time as decoding a
syndrome at t = 99. This is not the case for other decoders, such as MWPM.

Our decoder was implemented using PyTorch [52] and PyTorch Geometric (PyG)
[53]. PyTorch is a machine learning library that enables efficient tensor computing.
PyG is built upon PyTorch and provides functionality to implement graph neural
networks.

23


3. Methods

3.3 Training setup
Since our decoder is based on a neural network, a significant amount of data is
needed to train it. As mentioned in section 3.1, data was obtained by generating it
using Stim. Instead of generating fixed training and test sets, data was generated
one batch at a time during training. As such, every batch that is fed through the
network is unique. This approach has both benefits and drawbacks. First, it de-
creases the tendency for the network to overfit on the provided data. On the other
hand, a significant amount of time during training is dedicated to data generation.
Further, because Stim runs on the CPU, every new batch of graphs that is generated
must be transferred to the GPU, which slows down training further. Generating the
data and transferring it to the GPU takes roughly 30%-50% of the total training
time.

Despite not using a fixed training set, the training was split into so-called epochs.
Generally, one epoch corresponds to feeding the whole data set through the neural
network once. We settled on 256 batches counting as one epoch, where each batch
consists of 2048 syndromes. Therefore, one epoch worth of data corresponds to
roughly 5 · 105 syndromes. The number of epochs required for the loss to converge
is positively correlated with code distance. Chapter 4 covers this in greater detail.
It is important to note that we did not make use of a test set as a way of evaluating
the performance of the model and the convergence of the loss during training. The
reason for this is that with no fixed training set, the model does not overfit on the
provided data, as each batch is randomly sampled from a state space much larger
than the total number of syndromes generated during training.

During training, the network learns to predict a label that is either 0 or 1. The
probability of a logical bit-flip or phase-flip error occurring is positively correlated
with syndrome length and error rate p. As such, the relative class proportion for
short syndromes is skewed towards class 0, i.e. the dataset is imbalanced. This can
cause issues when training machine learning models [54]. Figure 3.4 shows the class
0 proportion as a function of syndrome length for code distances 3, 5, and 7. We can
see that for shorter syndrome lengths, the dataset is indeed imbalanced. However,
the class 0 proportion converges to 0.5 as the syndrome length increases. The models
we trained were all based on syndromes with roughly equal class proportions.

24


3. Methods

20 40 60 80 100
Number of surface code cycles

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

Cl
as

s 0
 p

ro
po

rti
on

d = 3
d = 5
d = 7

Figure 3.4: Proportion of class 0 syndromes as a function of
syndrome length for code distances d = 3, 5, 7, with error rate
p = 0.003. The error bars represent one standard deviation. The
results are based on roughly 2.5·105 randomly generated syndromes
per data point.

It is important to note that syndromes containing no detection events are always
excluded, both during training and inference. Since nodes in the syndrome graphs
are based on detection events, the absence of them implies that all the graphs be-
longing to a syndrome are empty. Syndromes without any detection events belong
to class 0, and are referred to as trivial cases. These cases are also excluded from
model performance metrics, such as accuracy.

Our decoder was trained using the Adam (Adaptive Moment Estimation) optimizer
[55], with betas β1 = 0.9, β2 = 0.999. Adam is similar to ordinary stochastic gradient
descent, but it leverages momentum and adaptive learning rates for each parameter.
Momentum means that part of the previous update is added onto the next. This
helps speed up convergence. The decoder was optimized by minimizing the binary
cross-entropy (BCE) loss function:

BCE = − 1
N

N∑
i=1

[yilog(Oi) + (1 − yi)log(1 −Oi)] (3.3)

where N is the number of observations, yi ∈ {0, 1} is the target label, and Oi ∈ (0, 1)
is the output from the network. The initial learning rate, η0, was set to 10−3.
During training, the learning rate was decreased exponentially to 10−4 according to
ηt = max(10−4, η00.95t), where t corresponds to the current epoch.

25


4
Results

With a method for obtaining data and a neural network architecture in place, mod-
els can be trained and evaluated. In this chapter we first cover the training of the
models before presenting their performance.

4.1 Decoder training

Three models were trained, one for each distance d ∈ {3, 5, 7}. All three models
were trained with a varying error rate p ∈ {0.001, 0.002, 0.003, 0.004, 0.005} that
was uniformly sampled on a batch-by-batch basis. Because the number of epochs
required for the loss to converge is positively correlated with code distance, the dis-
tance 7 model was trained for 500 epochs, the distance 5 model for 300 epochs, and
the distance 3 model for 150 epochs. Since each epoch consists of roughly 5 · 105

syndromes, the distance 7 model was trained on 2.5 · 108 syndromes, the distance 5
model on 1.5 · 108 syndromes, and the distance 3 model on 7.5 · 107 syndromes.

The three models were trained with slightly different syndrome graph settings. Both
the distance 5 and 7 model were trained on syndromes of length T = 49, with sliding
window size dT = 2. The distance 3 model was instead trained on syndromes of
length T = 99 with sliding window size dT = 5. There are two reasons for this. First,
the time required to train models increases with code distance, syndrome length,
and sliding window size. The distance 7 model took roughly four days and eighteen
hours to train on an Nvidia A40 GPU. Had it instead been trained on syndromes of
length T = 99 with dT = 5, it would have taken more than a week. Therefore, in the
interest of time, we settled on shorter syndromes. Secondly, we experimented with
multiple different syndrome lengths and sliding window sizes, especially on distance
3 models, as they can be trained relatively quickly. We found that models that were
trained on syndromes of length T = 49 had similar performance to those trained
using T = 99. A few distance 3 models were trained on syndromes of length T = 24,
too. While these models performed relatively well on shorter syndromes, they failed
to generalize for longer syndromes.

The state space grows exponentially with code distance. As such, the amount of
information that the neural network needs to encode also grows exponentially. Ini-
tially, the models had the same number of parameters regardless of code distance,
roughly 5.3 · 105, with a graph embedding vector of length 256, and the hidden

26


4. Results

state of the GRU a vector of length 128. This configuration worked well for code
distances 3 and 5, but it was not enough for code distance 7. Therefore, the sizes of
the graph embedding vector and hidden state vector for the code distance 7 model
were doubled to 512 and 256, respectively, increasing the number of parameters to
roughly 2.1 · 106. See appendix A.2 for a more detailed breakdown of the number of
parameters and architecture of each model.

The results from training are summarized in figure 4.1. The left panel shows the
accuracy as a function of training epoch. As expected, the distance 3 model con-
verges at a lower accuracy compared to the two other models. The right panel shows
the mean model output for class 0 and class 1 before rounding, also as a function
of training epoch. The model output can be interpreted as the confidence of the
models. Based on this interpretation, the distance 3 model is less confident with
respect to which input belongs to which class.

0 100 200 300 400 500
Epoch

0.6

0.7

0.8

0.9

Ac
cu

ra
cy

d = 3
d = 5
d = 7

0 100 200 300 400 500
Epoch

0.2

0.4

0.6

0.8

M
od

el
 o

ut
pu

t d = 3
d = 5
d = 7
Class 1
Class 0

Figure 4.1: Accuracy (left) and model output (right) as a function
of training epoch. Each epoch consists of roughly 5 · 105 randomly
generated syndromes.

4.2 Decoder performance
Having trained the models, it is time to evaluate their performance. The MWPM
decoder implemented by the PyMatching library [56] is used as a benchmark for all
performance metrics, and is represented by dashed lines in figures 4.2 through 4.4.
Figure 4.2 shows the accuracy and failure rate of the three models as a function
of surface code cycle. The distance 3 and 5 models both outperform MWPM over
the entire domain. However, the distance 7 model has similar or worse performance
compared to MWPM.

27


4. Results

0 200 400 600 800 1000
Number of surface code cycles

0.825

0.850

0.875

0.900

0.925

0.950

0.975

1.000
Ac

cu
ra

cy

d = 3
d = 5
d = 7
Our model
MWPM

0 200 400 600 800 1000
Number of surface code cycles

10 4

10 3

10 2

10 1

Fa
ilu

re
 ra

te

d = 3
d = 5
d = 7
Our model
MWPM

Figure 4.2: Accuracy (left) and failure rate (right) as a function
of surface code cycles, with error rate p = 0.001. The vertical bars
indicate standard deviation. All models were tested with 5 · 105

syndromes per data point.

Figure 4.3 shows similar data compared to figure 4.2, but here the x-axis is loga-
rithmic and the models are evaluated on syndromes up to 104 surface code cycles.
As seen in figure 4.3, the performance of the distance 3 model drastically decreases,
approaching an accuracy of 0.5. This is equivalent to randomly guessing when the
dataset is balanced and there are two classes, as in this case. However, this is not
due to poor performance of the decoders, it is an intrinsic limitation of the life-time
of the qubit. The distance 5 model continues to confidently outperform MWPM
whereas the distance 7 model is noticeably worse than MWPM. While all three
models are trained on syndromes shorter than a hundred surface code cycles, the
figures suggest that they are able to generalize for much longer time series. The
distance 5 model, especially, performs well relative to MWPM for long time series.

101 102 103 104

Number of surface code cycles

0.5

0.6

0.7

0.8

0.9

1.0

Ac
cu

ra
cy

d = 3
d = 5
d = 7
Our model
MWPM

101 102 103 104

Number of surface code cycles

10 5

10 4

10 3

10 2

10 1

Fa
ilu

re
 ra

te

d = 3
d = 5
d = 7
Our model
MWPM

Figure 4.3: Accuracy (left) and failure rate (right) as a function
of surface code cycles up to 104 cycles, with error rate p = 0.001.
The vertical bars indicate standard deviation. All models are tested
with 6 · 104 syndromes per data point.

Next, the performance of the models is evaluated with respect to error rate p. Figure
4.4 shows model performance for a fixed syndrome length T = 99 as a function of

28


4. Results

error rate. The distance 3 and 5 models both outperform MWPM for all error rates
p ∈ {0.001, 0.002, 0.003, 0.004, 0.005}. The distance 7 model, on the other hand,
has similar performance as compared to MWPM for p = 0.001, but its relative
performance decreases as the error rate increases.

0.001 0.002 0.003 0.004 0.005
Error rate

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Ac
cu

ra
cy

d = 3
d = 5
d = 7
Our model
MWPM

0.001 0.002 0.003 0.004 0.005
Error rate

10 3

10 2

10 1

Fa
ilu

re
 ra

te

d = 3
d = 5
d = 7
Our model
MWPM

Figure 4.4: Accuracy (left) and failure rate (right) as a function
of error rate. The vertical bars indicate standard deviation. All
models are tested with syndrome length T = 99 with 5 · 105 syn-
dromes per data point.

Finally, the time required to decode syndromes is evaluated. While our model is
slower than MWPM on any fixed length syndrome decoding, the use of an RNN
allows the decoder to process the syndrome at every time step. This means that,
instead of accumulating syndrome measurements and decoding the whole syndrome
at once, the decoder is able to process information one cycle at a time. The reason for
this is the decoder only requires the previous hidden state of the RNN and syndrome
measurements from the last dT (dT << T ) time steps as input. This ensures that
the decoder processing time per cycle is constant with respect to cycle count. This
helps decrease the latency of the model, defined as the time required to decode the
syndrome once the final measurement has been fed to the decoder [21], [22], since the
model is able to process syndrome measurements during computation. In order to
achieve zero latency, the processing time per cycle must be as fast as the clock speed
of the quantum computer, which for superconducting qubits is roughly 1µs [21], [22],
[57], [58]. Our decoder is around one order of magnitude slower than this, achieving
a per cycle processing time of 8.76 ± 0.94µs for the distance 3 model, 11.33 ± 0.66µs
for the distance 5 model, and 17.87 ± 0.65µs for the distance 7 model, with error
rate p = 0.1%, running on an Nvidia A40 GPU. However, this could be improved
by optimizing the model for inference, and running on specialized hardware such as
a field-programmable gate array (FPGA).

29


5
Conclusions

This work introduced a data-driven approach to decoding the surface code using a
combination of graph neural networks and recurrent neural networks. The decoder
was trained on hundreds of millions of simulations of the surface code under circuit
level noise. It is able to outperform the classical MWPM algorithm on code distances
3 and 5 for thousands of surface code cycles using an error rate p = 0.001. However,
it falls behind on code distance 7. Further, our decoder outperforms MWPM for all
error rates p ∈ {0.001, 0.002, 0.003, 0.004, 0.005} on code distances 3 and 5 for a fixed
syndrome length T = 99. When it comes to decoding time, MWPM is faster than
our decoder when decoding a whole syndrome at once. However, due to the recurrent
nature of our decoder, it is able to process syndrome measurements one cycle at a
time during computation, as opposed to accumulating measurements and decoding
them all at once. Further, the processing time per cycle is constant with respect to
cycle count. These factors help reduce the latency of the decoder, which is desirable.

There are a couple of problems with the approach presented in this work. The pri-
mary problem relates to the amount of time required for data generation. Since we
chose to always generate new data during training, as opposed to having a fixed data
set, data must continuously be generated on the CPU and transferred to the GPU
on a batch-by-batch basis during training and inference. This takes a long time,
around 30%-50% of the total training time. This problem is exacerbated by the fact
that the way in which the syndrome graphs are created in an overlapping fashion
also takes a long time. The reason this takes a long time is that all nodes belonging
to a batch are put in the same two-dimensional tensor of shape [N, 5] where N is
the total number of nodes in a batch and 5 is the number of node features. Ideally,
a three-dimensional tensor of shape [B,Ni, 5], where B is the batch size and Ni the
number of nodes in syndrome i, would be used, since the node duplication could
be vectorized with this approach. However, because the number of nodes varies
from syndrome to syndrome, i.e. Ni is not constant across a batch, this does not
work. Further, the graph convolution layers [59] of PyTorch Geometric expect a
two-dimensional input. As such, much of the node duplication is done inside of a
Python for-loop, as opposed to being done in a vectorized fashion, adding a further
performance penalty.

The neural network architecture described in this paper only decodes the hidden
state of the RNN corresponding to the last time step. This is because, by default,
Stim only returns a label indicating which class the syndromes belong to for the last
time step. However, with a bit of extra work, it is possible to extract the syndrome

30


5. Conclusions

class for each time step. This could potentially improve both the performance and
training time of the decoder, since all hidden states could be used during training,
instead of neglecting all but the last one.

Finally, ethical considerations of this work relate to the fact that quantum error cor-
recting decoders contribute to enabling reliable quantum computing. While quan-
tum computing is not inherently unethical, it can be used for nefarious purposes,
e.g. breaking encryption schemes such as RSA and Diffie-Hellman using Shor’s
algorithm [4], [5], [60]. Further, the decoder introduced in this work is based on
neural networks, specifically leveraging machine learning to learn a mapping from
syndrome measurement data to likely error chains. This introduces additional ethi-
cal dimensions beyond those tied to quantum computing alone. Neural networks are
often characterized as ”black box” models. While they may demonstrate high per-
formance, their internal decision-making processes are typically opaque and difficult
to interpret. This lack of transparency raises concerns around trust, verification,
and accountability, especially in safety-critical or high-stakes domains.

31


5. Conclusions

32


References

[1] R. Santagati, A. Aspuru-Guzik, R. Babbush, et al., “Drug design on quantum
computers,” Nature Physics, vol. 20, no. 4, pp. 549–557, Mar. 2024, issn:
1745-2481. doi: 10.1038/s41567-024-02411-5. [Online]. Available: http:
//dx.doi.org/10.1038/s41567-024-02411-5.

[2] H. Shang, L. Shen, Y. Fan, et al., “Large-scale simulation of quantum com-
putational chemistry on a new sunway supercomputer,” Dec. 2022. doi: 10.
1109/SC41404.2022.00019.

[3] N. Stamatopoulos, D. J. Egger, Y. Sun, et al., “Option pricing using quantum
computers,” Quantum, vol. 4, p. 291, Jul. 2020, issn: 2521-327X. doi: 10.
22331/q-2020-07-06-291. [Online]. Available: http://dx.doi.org/10.
22331/q-2020-07-06-291.

[4] P. Shor, “Algorithms for quantum computation: Discrete logarithms and fac-
toring,” in Proceedings 35th Annual Symposium on Foundations of Computer
Science, 1994, pp. 124–134. doi: 10.1109/SFCS.1994.365700.

[5] P. W. Shor, “Polynomial-time algorithms for prime factorization and dis-
crete logarithms on a quantum computer,” SIAM Journal on Computing,
vol. 26, no. 5, pp. 1484–1509, Oct. 1997, issn: 1095-7111. doi: 10 . 1137 /
s0097539795293172. [Online]. Available: http://dx.doi.org/10.1137/
S0097539795293172.

[6] W. K. Wootters and W. H. Zurek, “A single quantum cannot be cloned,”
Nature, vol. 299, no. 5886, pp. 802–803, Oct. 1982. doi: 10.1038/299802a0.
[Online]. Available: http://dx.doi.org/10.1038/299802a0.

[7] A. Kitaev, “Fault-tolerant quantum computation by anyons,” Annals of Physics,
vol. 303, no. 1, pp. 2–30, Jan. 2003, issn: 0003-4916. doi: 10.1016/s0003-
4916(02)00018-0. [Online]. Available: http://dx.doi.org/10.1016/S0003-
4916(02)00018-0.

[8] E. Dennis, A. Kitaev, A. Landahl, and J. Preskill, “Topological quantum mem-
ory,” Journal of Mathematical Physics, vol. 43, no. 9, pp. 4452–4505, Sep.
2002, issn: 1089-7658. doi: 10.1063/1.1499754. [Online]. Available: http:
//dx.doi.org/10.1063/1.1499754.

[9] J. Edmonds, “Paths, trees, and flowers,” Canadian Journal of Mathematics,
vol. 17, pp. 449–467, 1965. doi: 10.4153/CJM-1965-045-4.

[10] D. S. Wang, A. G. Fowler, A. M. Stephens, and L. C. L. Hollenberg, Threshold
error rates for the toric and surface codes, 2009. arXiv: 0905.0531 [quant-ph].
[Online]. Available: https://arxiv.org/abs/0905.0531.

[11] D. S. Wang, A. G. Fowler, and L. C. L. Hollenberg, “Surface code quantum
computing with error rates over 1%,” Phys. Rev. A, vol. 83, p. 020 302, 2

33

https://doi.org/10.1038/s41567-024-02411-5
http://dx.doi.org/10.1038/s41567-024-02411-5
http://dx.doi.org/10.1038/s41567-024-02411-5
https://doi.org/10.1109/SC41404.2022.00019
https://doi.org/10.1109/SC41404.2022.00019
https://doi.org/10.22331/q-2020-07-06-291
https://doi.org/10.22331/q-2020-07-06-291
http://dx.doi.org/10.22331/q-2020-07-06-291
http://dx.doi.org/10.22331/q-2020-07-06-291
https://doi.org/10.1109/SFCS.1994.365700
https://doi.org/10.1137/s0097539795293172
https://doi.org/10.1137/s0097539795293172
http://dx.doi.org/10.1137/S0097539795293172
http://dx.doi.org/10.1137/S0097539795293172
https://doi.org/10.1038/299802a0
http://dx.doi.org/10.1038/299802a0
https://doi.org/10.1016/s0003-4916(02)00018-0
https://doi.org/10.1016/s0003-4916(02)00018-0
http://dx.doi.org/10.1016/S0003-4916(02)00018-0
http://dx.doi.org/10.1016/S0003-4916(02)00018-0
https://doi.org/10.1063/1.1499754
http://dx.doi.org/10.1063/1.1499754
http://dx.doi.org/10.1063/1.1499754
https://doi.org/10.4153/CJM-1965-045-4
https://arxiv.org/abs/0905.0531
https://arxiv.org/abs/0905.0531


References

Feb. 2011. doi: 10.1103/PhysRevA.83.020302. [Online]. Available: https:
//link.aps.org/doi/10.1103/PhysRevA.83.020302.

[12] T. M. Stace, S. D. Barrett, and A. C. Doherty, “Thresholds for topological
codes in the presence of loss,” Phys. Rev. Lett., vol. 102, p. 200 501, 20 May
2009. doi: 10.1103/PhysRevLett.102.200501. [Online]. Available: https:
//link.aps.org/doi/10.1103/PhysRevLett.102.200501.

[13] T. M. Stace and S. D. Barrett, “Error correction and degeneracy in surface
codes suffering loss,” Phys. Rev. A, vol. 81, p. 022 317, 2 Feb. 2010. doi:
10.1103/PhysRevA.81.022317. [Online]. Available: https://link.aps.org/
doi/10.1103/PhysRevA.81.022317.

[14] A. G. Fowler, Optimal complexity correction of correlated errors in the surface
code, 2013. arXiv: 1310 . 0863 [quant-ph]. [Online]. Available: https : / /
arxiv.org/abs/1310.0863.

[15] A. G. Fowler, Minimum weight perfect matching of fault-tolerant topological
quantum error correction in average O(1) parallel time, 2014. arXiv: 1307.
1740 [quant-ph]. [Online]. Available: https://arxiv.org/abs/1307.1740.

[16] B. Criger and I. Ashraf, “Multi-path Summation for Decoding 2D Topological
Codes,” Quantum, vol. 2, p. 102, Oct. 2018, issn: 2521-327X. doi: 10.22331/
q-2018-10-19-102. [Online]. Available: https://doi.org/10.22331/q-
2018-10-19-102.

[17] J. R. Wootton and D. Loss, “High threshold error correction for the surface
code,” Phys. Rev. Lett., vol. 109, p. 160 503, 16 Oct. 2012. doi: 10.1103/
PhysRevLett.109.160503. [Online]. Available: https://link.aps.org/doi/
10.1103/PhysRevLett.109.160503.

[18] S. Bravyi, M. Suchara, and A. Vargo, “Efficient algorithms for maximum like-
lihood decoding in the surface code,” Phys. Rev. A, vol. 90, p. 032 326, 3
Sep. 2014. doi: 10.1103/PhysRevA.90.032326. [Online]. Available: https:
//link.aps.org/doi/10.1103/PhysRevA.90.032326.

[19] A. Hutter, J. R. Wootton, and D. Loss, “Efficient markov chain monte carlo
algorithm for the surface code,” Phys. Rev. A, vol. 89, p. 022 326, 2 Feb. 2014.
doi: 10.1103/PhysRevA.89.022326. [Online]. Available: https://link.aps.
org/doi/10.1103/PhysRevA.89.022326.

[20] K. Hammar, A. Orekhov, P. W. Hybelius, et al., “Error-rate-agnostic decoding
of topological stabilizer codes,” Phys. Rev. A, vol. 105, p. 042 616, 4 Apr. 2022.
doi: 10.1103/PhysRevA.105.042616. [Online]. Available: https://link.
aps.org/doi/10.1103/PhysRevA.105.042616.

[21] R. Acharya, L. Aghababaie-Beni, I. Aleiner, et al., Quantum error correction
below the surface code threshold, 2024. arXiv: 2408.13687 [quant-ph]. [On-
line]. Available: https://arxiv.org/abs/2408.13687.

[22] J. Bausch, A. W. Senior, F. J. H. Heras, et al., “Learning to decode the surface
code with a recurrent, transformer-based neural network,” Oct. 2023. [Online].
Available: https://doi.org/10.48550/arXiv.2310.05900.

[23] G. Torlai and R. G. Melko, “Neural decoder for topological codes,” Physical
Review Letters, vol. 119, no. 3, Jul. 2017, issn: 1079-7114. doi: 10.1103/
physrevlett.119.030501. [Online]. Available: http://dx.doi.org/10.
1103/PhysRevLett.119.030501.

34

https://doi.org/10.1103/PhysRevA.83.020302
https://link.aps.org/doi/10.1103/PhysRevA.83.020302
https://link.aps.org/doi/10.1103/PhysRevA.83.020302
https://doi.org/10.1103/PhysRevLett.102.200501
https://link.aps.org/doi/10.1103/PhysRevLett.102.200501
https://link.aps.org/doi/10.1103/PhysRevLett.102.200501
https://doi.org/10.1103/PhysRevA.81.022317
https://link.aps.org/doi/10.1103/PhysRevA.81.022317
https://link.aps.org/doi/10.1103/PhysRevA.81.022317
https://arxiv.org/abs/1310.0863
https://arxiv.org/abs/1310.0863
https://arxiv.org/abs/1310.0863
https://arxiv.org/abs/1307.1740
https://arxiv.org/abs/1307.1740
https://arxiv.org/abs/1307.1740
https://doi.org/10.22331/q-2018-10-19-102
https://doi.org/10.22331/q-2018-10-19-102
https://doi.org/10.22331/q-2018-10-19-102
https://doi.org/10.22331/q-2018-10-19-102
https://doi.org/10.1103/PhysRevLett.109.160503
https://doi.org/10.1103/PhysRevLett.109.160503
https://link.aps.org/doi/10.1103/PhysRevLett.109.160503
https://link.aps.org/doi/10.1103/PhysRevLett.109.160503
https://doi.org/10.1103/PhysRevA.90.032326
https://link.aps.org/doi/10.1103/PhysRevA.90.032326
https://link.aps.org/doi/10.1103/PhysRevA.90.032326
https://doi.org/10.1103/PhysRevA.89.022326
https://link.aps.org/doi/10.1103/PhysRevA.89.022326
https://link.aps.org/doi/10.1103/PhysRevA.89.022326
https://doi.org/10.1103/PhysRevA.105.042616
https://link.aps.org/doi/10.1103/PhysRevA.105.042616
https://link.aps.org/doi/10.1103/PhysRevA.105.042616
https://arxiv.org/abs/2408.13687
https://arxiv.org/abs/2408.13687
https://doi.org/10.48550/arXiv.2310.05900
https://doi.org/10.1103/physrevlett.119.030501
https://doi.org/10.1103/physrevlett.119.030501
http://dx.doi.org/10.1103/PhysRevLett.119.030501
http://dx.doi.org/10.1103/PhysRevLett.119.030501


References

[24] S. Krastanov and L. Jiang, “Deep neural network probabilistic decoder for
stabilizer codes,” Scientific Reports, vol. 7, no. 1, Sep. 2017, issn: 2045-2322.
doi: 10.1038/s41598-017-11266-1. [Online]. Available: http://dx.doi.
org/10.1038/s41598-017-11266-1.

[25] S. Varsamopoulos, B. Criger, and K. Bertels, “Decoding small surface codes
with feedforward neural networks,” Quantum Science and Technology, vol. 3,
no. 1, p. 015 004, Nov. 2017, issn: 2058-9565. doi: 10.1088/2058- 9565/
aa955a. [Online]. Available: http://dx.doi.org/10.1088/2058- 9565/
aa955a.

[26] P. Baireuther, T. E. O’Brien, B. Tarasinski, and C. W. J. Beenakker, “Machine-
learning-assisted correction of correlated qubit errors in a topological code,”
Quantum, vol. 2, p. 48, Jan. 2018, issn: 2521-327X. doi: 10.22331/q-2018-
01-29-48. [Online]. Available: http://dx.doi.org/10.22331/q-2018-01-
29-48.

[27] N. P. Breuckmann and X. Ni, “Scalable neural network decoders for higher
dimensional quantum codes,” Quantum, vol. 2, p. 68, May 2018, issn: 2521-
327X. doi: 10.22331/q-2018-05-24-68. [Online]. Available: http://dx.
doi.org/10.22331/q-2018-05-24-68.

[28] S. Varsamopoulos, K. Bertels, and C. G. Almudever, “Comparing neural net-
work based decoders for the surface code,” IEEE Transactions on Computers,
vol. 69, no. 2, pp. 300–311, Feb. 2020, issn: 2326-3814. doi: 10.1109/tc.
2019.2948612. [Online]. Available: http://dx.doi.org/10.1109/TC.2019.
2948612.

[29] P. Baireuther, M. D. Caio, B. Criger, C. W. J. Beenakker, and T. E. O’Brien,
“Neural network decoder for topological color codes with circuit level noise,”
New Journal of Physics, vol. 21, no. 1, p. 013 003, Jan. 2019, issn: 1367-2630.
doi: 10.1088/1367-2630/aaf29e. [Online]. Available: http://dx.doi.org/
10.1088/1367-2630/aaf29e.

[30] B. M. Varbanov, M. Serra-Peralta, D. Byfield, and B. M. Terhal, “Neural net-
work decoder for near-term surface-code experiments,” Phys. Rev. Res., vol. 7,
p. 013 029, 1 Jan. 2025. doi: 10.1103/PhysRevResearch.7.013029. [Online].
Available: https://link.aps.org/doi/10.1103/PhysRevResearch.7.
013029.

[31] H. Jung, I. Ali, and J. Ha, “Convolutional neural decoder for surface codes,”
IEEE Transactions on Quantum Engineering, vol. 5, pp. 1–13, 2024. doi:
10.1109/TQE.2024.3419773.

[32] V. K, D. S, and S. T, “Rl-qec: Harnessing reinforcement learning for quantum
error correction advancements,” in 2024 International Conference on Trends
in Quantum Computing and Emerging Business Technologies, 2024, pp. 1–5.
doi: 10.1109/TQCEBT59414.2024.10545200.

[33] R. Sweke, M. S. Kesselring, E. P. L. van Nieuwenburg, and J. Eisert, “Re-
inforcement learning decoders for fault-tolerant quantum computation,” Ma-
chine Learning: Science and Technology, vol. 2, no. 2, p. 025 005, Dec. 2020.
doi: 10.1088/2632-2153/abc609. [Online]. Available: https://dx.doi.
org/10.1088/2632-2153/abc609.

35

https://doi.org/10.1038/s41598-017-11266-1
http://dx.doi.org/10.1038/s41598-017-11266-1
http://dx.doi.org/10.1038/s41598-017-11266-1
https://doi.org/10.1088/2058-9565/aa955a
https://doi.org/10.1088/2058-9565/aa955a
http://dx.doi.org/10.1088/2058-9565/aa955a
http://dx.doi.org/10.1088/2058-9565/aa955a
https://doi.org/10.22331/q-2018-01-29-48
https://doi.org/10.22331/q-2018-01-29-48
http://dx.doi.org/10.22331/q-2018-01-29-48
http://dx.doi.org/10.22331/q-2018-01-29-48
https://doi.org/10.22331/q-2018-05-24-68
http://dx.doi.org/10.22331/q-2018-05-24-68
http://dx.doi.org/10.22331/q-2018-05-24-68
https://doi.org/10.1109/tc.2019.2948612
https://doi.org/10.1109/tc.2019.2948612
http://dx.doi.org/10.1109/TC.2019.2948612
http://dx.doi.org/10.1109/TC.2019.2948612
https://doi.org/10.1088/1367-2630/aaf29e
http://dx.doi.org/10.1088/1367-2630/aaf29e
http://dx.doi.org/10.1088/1367-2630/aaf29e
https://doi.org/10.1103/PhysRevResearch.7.013029
https://link.aps.org/doi/10.1103/PhysRevResearch.7.013029
https://link.aps.org/doi/10.1103/PhysRevResearch.7.013029
https://doi.org/10.1109/TQE.2024.3419773
https://doi.org/10.1109/TQCEBT59414.2024.10545200
https://doi.org/10.1088/2632-2153/abc609
https://dx.doi.org/10.1088/2632-2153/abc609
https://dx.doi.org/10.1088/2632-2153/abc609


References

[34] P. Andreasson, J. Johansson, S. Liljestrand, and M. Granath, “Quantum error
correction for the toric code using deep reinforcement learning,” Quantum,
vol. 3, p. 183, Sep. 2019, issn: 2521-327X. doi: 10.22331/q-2019-09-02-
183. [Online]. Available: https://doi.org/10.22331/q-2019-09-02-183.

[35] M. Lange, P. Havström, B. Srivastava, et al., Data-driven decoding of quantum
error correcting codes using graph neural networks, 2023. arXiv: 2307.01241
[quant-ph]. [Online]. Available: https://arxiv.org/abs/2307.01241.

[36] R. Feynman, F. Vernon, and R. Hellwarth, “Geometrical representation of
the Schrödinger equation for solving maser problems,” J. Appl. Phys., vol. 28,
pp. 49–52, 1957.

[37] R. Cleve, A. Ekert, C. Macchiavello, and M. Mosca, “Quantum algorithms
revisited,” Proceedings of the Royal Society of London. Series A: Mathematical,
Physical and Engineering Sciences, vol. 454, no. 1969, pp. 339–354, Jan. 1998,
issn: 1471-2946. doi: 10.1098/rspa.1998.0164. [Online]. Available: http:
//dx.doi.org/10.1098/rspa.1998.0164.

[38] J. Roffe, “Quantum error correction: An introductory guide,” Contemporary
Physics, vol. 60, no. 3, pp. 226–245, Jul. 2019, issn: 1366-5812. doi: 10.1080/
00107514.2019.1667078. [Online]. Available: http://dx.doi.org/10.1080/
00107514.2019.1667078.

[39] D. Gottesman, Stabilizer codes and quantum error correction, 1997. arXiv:
quant-ph/9705052 [quant-ph]. [Online]. Available: https://arxiv.org/
abs/quant-ph/9705052.

[40] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, “Surface
codes: Towards practical large-scale quantum computation,” Phys. Rev. A,
vol. 86, p. 032 324, 3 Sep. 2012. doi: 10.1103/PhysRevA.86.032324. [Online].
Available: https://link.aps.org/doi/10.1103/PhysRevA.86.032324.

[41] Y. Xiao, B. Srivastava, and M. Granath, “Exact results on finite size correc-
tions for surface codes tailored to biased noise,” Quantum, vol. 8, p. 1468,
Sep. 2024, issn: 2521-327X. doi: 10.22331/q-2024-09-11-1468. [Online].
Available: https://doi.org/10.22331/q-2024-09-11-1468.

[42] A. deMarti iOlius, P. Fuentes, R. Orús, P. M. Crespo, and J. Etxezarreta
Martinez, “Decoding algorithms for surface codes,” Quantum, vol. 8, p. 1498,
Oct. 2024, issn: 2521-327X. doi: 10.22331/q-2024-10-10-1498. [Online].
Available: http://dx.doi.org/10.22331/q-2024-10-10-1498.

[43] W. Mcculloch and W. Pitts, “A logical calculus of ideas immanent in nervous
activity,” Bulletin of Mathematical Biophysics, vol. 5, pp. 127–147, 1943.

[44] S. Linnainmaa, “Algoritmin kumulatiivinen pyöristysvirhe yksittäisten pyöristysvirhei-
den taylor-kehitelmänä (the representation of the cumulative rounding er-
ror of an algorithm as a taylor expansion of the local rounding errors),”
Thesis, 1970. [Online]. Available: https://people.idsia.ch/~juergen/
linnainmaa1970thesis.pdf.

[45] J. L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, pp. 179–
211, 1990.

[46] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen netzen,” Apr.
1991.

36

https://doi.org/10.22331/q-2019-09-02-183
https://doi.org/10.22331/q-2019-09-02-183
https://doi.org/10.22331/q-2019-09-02-183
https://arxiv.org/abs/2307.01241
https://arxiv.org/abs/2307.01241
https://arxiv.org/abs/2307.01241
https://doi.org/10.1098/rspa.1998.0164
http://dx.doi.org/10.1098/rspa.1998.0164
http://dx.doi.org/10.1098/rspa.1998.0164
https://doi.org/10.1080/00107514.2019.1667078
https://doi.org/10.1080/00107514.2019.1667078
http://dx.doi.org/10.1080/00107514.2019.1667078
http://dx.doi.org/10.1080/00107514.2019.1667078
https://arxiv.org/abs/quant-ph/9705052
https://arxiv.org/abs/quant-ph/9705052
https://arxiv.org/abs/quant-ph/9705052
https://doi.org/10.1103/PhysRevA.86.032324
https://link.aps.org/doi/10.1103/PhysRevA.86.032324
https://doi.org/10.22331/q-2024-09-11-1468
https://doi.org/10.22331/q-2024-09-11-1468
https://doi.org/10.22331/q-2024-10-10-1498
http://dx.doi.org/10.22331/q-2024-10-10-1498
https://people.idsia.ch/~juergen/linnainmaa1970thesis.pdf
https://people.idsia.ch/~juergen/linnainmaa1970thesis.pdf


References

[47] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with
gradient descent is difficult,” IEEE Transactions on Neural Networks, vol. 5,
no. 2, pp. 157–166, 1994. doi: 10.1109/72.279181.

[48] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Compu-
tation, vol. 9, pp. 1735–1780, Nov. 1997. doi: 10.1162/neco.1997.9.8.1735.

[49] K. Cho, B. van Merrienboer, C. Gulcehre, et al., Learning phrase represen-
tations using rnn encoder-decoder for statistical machine translation, 2014.
arXiv: 1406.1078 [cs.CL]. [Online]. Available: https://arxiv.org/abs/
1406.1078.

[50] C. Morris, M. Ritzert, M. Fey, et al., Weisfeiler and leman go neural: Higher-
order graph neural networks, 2021. arXiv: 1810 . 02244 [cs.LG]. [Online].
Available: https://arxiv.org/abs/1810.02244.

[51] C. Gidney, “Stim: A fast stabilizer circuit simulator,” Quantum, vol. 5, p. 497,
Jul. 2021, issn: 2521-327X. doi: 10.22331/q-2021-07-06-497. [Online].
Available: https://doi.org/10.22331/q-2021-07-06-497.

[52] J. Ansel, E. Yang, H. He, et al., “Pytorch 2: Faster machine learning through
dynamic python bytecode transformation and graph compilation,” in Pro-
ceedings of the 29th ACM International Conference on Architectural Sup-
port for Programming Languages and Operating Systems, Volume 2 (ASPLOS
’24), ACM, Apr. 2024. doi: 10.1145/3620665.3640366. [Online]. Available:
https://pytorch.org/assets/pytorch2-2.pdf.

[53] M. Fey and J. E. Lenssen, Fast graph representation learning with pytorch
geometric, 2019. arXiv: 1903 . 02428 [cs.LG]. [Online]. Available: https :
//arxiv.org/abs/1903.02428.

[54] M. Altalhan, A. Algarni, and M. Turki-Hadj Alouane, “Imbalanced data prob-
lem in machine learning: A review,” IEEE Access, vol. 13, pp. 13 686–13 699,
2025. doi: 10.1109/ACCESS.2025.3531662.

[55] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, 2017.
arXiv: 1412.6980 [cs.LG]. [Online]. Available: https://arxiv.org/abs/
1412.6980.

[56] O. Higgott, Pymatching: A python package for decoding quantum codes with
minimum-weight perfect matching, 2021. arXiv: 2105.13082 [quant-ph]. [On-
line]. Available: https://arxiv.org/abs/2105.13082.

[57] F. Arute, K. Arya, R. Babbush, et al., “Quantum supremacy using a pro-
grammable superconducting processor,” Nature, vol. 574, pp. 505–510, 2019.
[Online]. Available: https : / / www . nature . com / articles / s41586 - 019 -
1666-5.

[58] R. Acharya, I. Aleiner, R. Allen, et al., “Suppressing quantum errors by scaling
a surface code logical qubit,” Nature, vol. 614, no. 7949, pp. 676–681, Feb. 2023,
issn: 1476-4687. doi: 10.1038/s41586-022-05434-1. [Online]. Available:
http://dx.doi.org/10.1038/s41586-022-05434-1.

[59] Conv.graphconv, 2025. [Online]. Available: https://pytorch- geometric.
readthedocs . io / en / latest / generated / torch _ geometric . nn . conv .
GraphConv.html#torch_geometric.nn.conv.GraphConv.

[60] M. Kaplan, G. Leurent, A. Leverrier, and M. Naya-Plasencia, “Breaking sym-
metric cryptosystems using quantum period finding,” in Advances in Cryp-

37

https://doi.org/10.1109/72.279181
https://doi.org/10.1162/neco.1997.9.8.1735
https://arxiv.org/abs/1406.1078
https://arxiv.org/abs/1406.1078
https://arxiv.org/abs/1406.1078
https://arxiv.org/abs/1810.02244
https://arxiv.org/abs/1810.02244
https://doi.org/10.22331/q-2021-07-06-497
https://doi.org/10.22331/q-2021-07-06-497
https://doi.org/10.1145/3620665.3640366
https://pytorch.org/assets/pytorch2-2.pdf
https://arxiv.org/abs/1903.02428
https://arxiv.org/abs/1903.02428
https://arxiv.org/abs/1903.02428
https://doi.org/10.1109/ACCESS.2025.3531662
https://arxiv.org/abs/1412.6980
https://arxiv.org/abs/1412.6980
https://arxiv.org/abs/1412.6980
https://arxiv.org/abs/2105.13082
https://arxiv.org/abs/2105.13082
https://www.nature.com/articles/s41586-019-1666-5
https://www.nature.com/articles/s41586-019-1666-5
https://doi.org/10.1038/s41586-022-05434-1
http://dx.doi.org/10.1038/s41586-022-05434-1
https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GraphConv.html#torch_geometric.nn.conv.GraphConv
https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GraphConv.html#torch_geometric.nn.conv.GraphConv
https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.nn.conv.GraphConv.html#torch_geometric.nn.conv.GraphConv


References

tology – CRYPTO 2016, M. Robshaw and J. Katz, Eds., Berlin, Heidelberg:
Springer Berlin Heidelberg, 2016, pp. 207–237, isbn: 978-3-662-53008-5.

38


A
Appendix

A.1 Appendix 1
The code for this project can be found at github.com/Olfj/QEC_GNN-RNN/releases/tag/v1.0

A.2 Appendix 2
Network parameters for the models presented in this thesis.

Layer Input Features Output Features Param #
GraphConv 1 5 32 352
GraphConv 2 32 64 4,160
GraphConv 3 64 128 16,512
GraphConv 4 128 256 65,792
GRU 256, 128 128 445,440
Linear 128 1 129
Total 532,385

Table A.1: Layer-wise input/output features and parameter
counts of the distance 3 and 5 GRUDecoder model.

Layer Input Features Output Features Param #
GraphConv 1 5 32 352
GraphConv 2 32 64 4,160
GraphConv 3 64 128 16,512
GraphConv 4 128 256 65,792
GraphConv 5 256 512 262,656
GRU 512, 256 256 1,775,616
Linear 256 1 257
Total 2,125,345

Table A.2: Layer-wise input/output features and parameter
counts of the larger distance 7 GRUDecoder model.

I

https://github.com/Olfj/QEC_GNN-RNN/releases/tag/v1.0


DEPARTMENT OF PHYSICS
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Acronyms
	List of Figures
	Introduction
	Theory
	Graph theory
	Quantum computing
	Qubit
	Pauli operators
	Quantum circuits
	Quantum gates
	Entanglement

	Quantum error codes
	Bit-flip repetition code
	The stabilizer formalism
	The surface code
	Minimum-weight perfect matching

	Neural networks
	Feed forward, fully connected neural networks
	Recurrent neural networks
	Gated recurrent unit
	Graph convolutional networks
	Global mean pool


	Methods
	Data generation
	Neural network architecture
	Training setup

	Results
	Decoder training
	Decoder performance

	Conclusions
	References
	Appendix
	Appendix 1
	Appendix 2