AI-guided detection of antibiotic-resistant
bacteria using resistance genes
Understanding the importance of pre-training and model
complexity when using transformer-based techniques to
detect antibiotic resistance

Master’s thesis in Biomedical Engineering

Erik Aerts

DEPARTMENT OF MATHEMATICAL SCIENCES

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2024
www.chalmers.se

www.chalmers.se


Master’s thesis 2024

AI-guided detection of antibiotic-resistant
bacteria using resistance genes

Understanding the importance of pre-training and model complexity
when using transformer-based techniques to detect antibiotic

resistance

ERIK AERTS

Department of Mathematical Sciences
Applied Mathematics and Statistics

Erik Kristiansson group
Chalmers University of Technology

Gothenburg, Sweden 2024

iii


AI-guided detection of antibiotic-resistant bacteria using resistance genes
Understanding the importance of pre-training and model complexity when using
transformer-based techniques to detect antibiotic resistance
ERIK AERTS

© ERIK AERTS, 2024.

Supervisor: Erik Kristiansson & Anna Johnning, Mathematical Sciences
Examiner: Erik Kristiansson, Mathematical Sciences

Master’s Thesis 2024
Department of Mathematical Sciences
Applied Mathematics and Statistics
Erik Kristiansson group
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2024

iv


AI-guided detection of antibiotic-resistant bacteria using resistance genes
Understanding the importance of pre-training and model complexity when using
transformer-based techniques to detect antibiotic resistance
Erik Aerts
Department of Mathematical Sciences
Chalmers University of Technology

Abstract
Antibiotic resistance is threatening advancements made in modern medicine. Un-
derstanding the genomics behind multi-resistant profiles can assist in planning the
correct treatment which can lower the abundance of antibiotic usage and hamper the
vicious resistance cycle. Transformer-based AI models have shown state-of-the-art
performance in understanding complex patterns in data. The thesis aimed to create
a framework on how to implement transformers to predict bacterial resistance pro-
files by training on genomic data. The framework consisted of a transformer-based
encoder and parallel classification networks for predicting antibiotic susceptibility.
Each model trained on antibiotic resistance genes (ARGs) from Escherichia coli
where a subset of isolates had recorded resistance profiles.

The results showed that having a high complexity in the encoder is key for the model
to accurately predict resistance to antibiotics where the occurrence of resistance is
rare. This is relevant for any clinical setting, as models with less than 12 encoder
blocks could not find these resistance profiles. The framework benefited from pre-
training on unlabeled genomic data as performance generally increased. However,
the type of masked language model pre-training which benefited the system more
was situational and no conclusion was drawn. Finally, the thesis also found features
in the data on which the models were basing decisions off on. The number of ARGs
of an isolate was deemed the most influential feature in the data which relates to how
much information the transformer can process. Following, relations between ARGs
gyrA-D87N / parC-S80I and aph(3”)-Ib / aph(6’)-Id were shown to be an important
decision basis for the models. Likewise, two point mutations of the pmrB gene also
stood out as important ARGs in the decision-making processes for the models. The
reasons why these ARGs are weighted highly by the models are currently unknown
but are of interest to be studied further for a better understanding of underlying
factors to multi-resistance.

Keywords: artificial intelligence, antibiotic resistance, transformer, self-attention,
embedding, encoder, pre-training, fine-tuning, masked language modelling.

v


Acknowledgements
I would like to thank the Department of Mathematical Sciences for their warm
welcoming of me, and a special thank you to those involved with my project. To
Erik and Anna, thank you for your time, patience and guidance throughout the
spring. The motivation you have given me has been instrumental in the enjoyment
of the project. A special thank you to my colleagues, Jesper and Michaela, for all
the help and discussions you have provided and lastly, I would like to thank family
and close friends for their encouragement and support.

Erik Aerts, Gothenburg, June 2024

vii


List of Acronyms

Below is the list of acronyms- and class for each antibiotics used in the thesis, listed
in alphabetical order:

Acronym Name Class
AMC Amoxicillin + clavulanic acid Penicillins
AMP Ampicillin Penicillins
AZT Aztreonam Monobactams
CZL Cefazolin Cephalosporins
CPM Cefepim Cephalosporins
CTX Cefotaxime Cephalosporins
CXT Cefoxitin Cephalosporins
CPD Cefpodoxime Cephalosporins
CTZ Ceftazidime Cephalosporins
CTR Ceftriaxone Cephalosporins
CMP Chloramphenicol Others
CIP Ciprofloxacin Quinolones
ETP Ertapenem Carbapenems
GEN Gentamicin Aminoglycosides
IMI Imipenem Carbapenems
LVX Levofloxacin Quinolones
MEM Meropenem Carbapenems
STR Streptomycin Aminoglycosides
SXT Sulfamethoxazole Sulfonamides
INN Sulfisoxazole Sulfonamides
TET Tetracycline Tetracyclines
TOB Tobramycin Aminoglycosides
TRI Trimethoprim Others
TMP Trimethoprim + sulfamethoxazole Aminoglycosides

ix


Contents

List of Acronyms ix

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Aim and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

3 Theory 6
3.1 Tokenization & Embedding space . . . . . . . . . . . . . . . . . . . . 6
3.2 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Multi-head attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Pre-training with MLM . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Transformer-based binary classification . . . . . . . . . . . . . . . . . 9
3.7 Overfitting & solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.8 Principal component analysis . . . . . . . . . . . . . . . . . . . . . . 10
3.9 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Methods 12
4.1 Data selection & pre-processing . . . . . . . . . . . . . . . . . . . . . 12
4.2 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 Masking sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Building the model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.4.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.4.2 Parallel classification networks . . . . . . . . . . . . . . . . . . 15

4.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.5.1 Early stopping . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.5.2 Selective freezing . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.6 PCA analysis of the CLS-tokens . . . . . . . . . . . . . . . . . . . . . 16

5 Results & Discussion 17
5.1 Pre-training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2 Hyperparameters in fine-tuning . . . . . . . . . . . . . . . . . . . . . 18

xi


Contents

5.2.1 Experimental setup 1 . . . . . . . . . . . . . . . . . . . . . . . 18
5.2.2 Experimental setup 2 . . . . . . . . . . . . . . . . . . . . . . . 19

5.3 Evaluating performance . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Analyzing learning patterns . . . . . . . . . . . . . . . . . . . . . . . 23

5.4.1 PC1: ARG quantity . . . . . . . . . . . . . . . . . . . . . . . 23
5.4.2 PC2: ARG groupings . . . . . . . . . . . . . . . . . . . . . . 25
5.4.3 PC3: pmrB mutations . . . . . . . . . . . . . . . . . . . . . . 27

5.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6 Conclusion 31

Bibliography 33

A Appendix 1 I

xii


List of Figures

3.1 Illustration of a multi-head attention layer as a flow chart. The input
XE is projected to the m self-attention heads for multi-head atten-
tion (orange). Each of the m head follows the self-attention princi-
ple (grey) where the weight matrices used are pointed to the related
block. The output from each head is concatenated (yellow) and lin-
early transformed (blue) to produce the resulting output Ŷ A from the
multi-head attention layer. . . . . . . . . . . . . . . . . . . . . . . . 8

4.1 Frequency distribution- and R/S percentages for each studied antibi-
otic. The order of the antibiotics is consistent between subfigures. . . 13

4.2 Illustration of the masked sequencing process. The three steps of the
process are listed to the left. The tokens of the sequence are illus-
trated as labelled boxes with additional type-specific colours: CLS -
token (yellow), ARG-token (green), PAD-token (blue) and MASK -
token (red). Additional information on the position of the tokens is
presented in red numbers above the respective steps of the process. . 14

4.3 Illustration of the model presented as a flow chart. The model is
divided into two parts: encoder (blue side) and parallel classification
networks (yellow side). The tokenized gene sequence is shown as
boxes labeled Ti and embeddings are shown as boxes labeled Ei. The
embedded output of the CLS -token is fed to the parallel networks
resulting in separate R/-predictions. . . . . . . . . . . . . . . . . . . 14

5.1 Training- and validation loss during pre-training for each model. The
names of the models are constructed with the complexity of the model
is denoted with a number (1 (blue), 3 (pink) or 12 (black)) followed
by a masking difficulty denoted with a letter (E for easy, H for hard).
Labels are consistent between subfigures. . . . . . . . . . . . . . . . . 17

5.2 Bar diagram for the recorded accuracies for each model tested in
experiment 1. The colour saturation for the bars indicates the com-
plexity of the respective model for each masking percentage, labelled
to the right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.3 Bar diagram for the recorded accuracies for each model tested in ex-
periment 2. The colour saturation of the bars indicates the embedding
size used for each model complexity, labelled to the left. . . . . . . . 20

xiii


List of Figures

5.5 Barplots with sensitivity data, grouped based on antibiotic classes.
Each bar is coloured based on the type of run that was performed: No
pre-training (green), Easy pre-training (blue) and Hard pre-training
(purple). The saturation of colours indicates the complexity of the
model: 1 encoder block (light), 3 encoder blocks (medium) and 12
encoder blocks (dark). Each subplot is labelled with which type of
antibiotic class is present and whether the fine-tuning was done with
an easy- or hard masking difficulty. . . . . . . . . . . . . . . . . . . . 21

5.6 Scatter plot of PC1 & PC2 for each CLS token from the tested models.
Each CLS token for all subplots has been coloured depending on the
number of ARGs for the respective sequence. Colour palettes are
consistent between subfigures. Each subplot is captioned with the
settings used in training to produce the model. . . . . . . . . . . . . 24

5.7 Scatter plot of PC1 & PC2 for each CLS token from the tested models.
Each CLS token for all subplots has been coloured depending on the
ARG content of the sequence. Colour palettes are consistent between
subfigures. Isolates with none of the studied point mutations (purple),
isolates with gyrA-D87N and parC-S80I (blue), isolates with aph(3”)-
Ib and aph(6’)-Id (green) and isolates with all studied point mutation
(yellow) are marked accordingly. Each subplot is captioned with the
settings used in training to produce the model. . . . . . . . . . . . . 26

5.8 Scatter plot of PC1 & PC3 for each CLS token from the tested mod-
els. Each CLS token for all subplots has been coloured depending on
the ARG content of the sequence. Isolates with none of the studied
point mutations (beige), isolates with pmrB-Y38N (aqua), isolates
with pmrB-E123D (navy) and isolates with all studied point muta-
tions (black) are marked accordingly. Colour palettes are consistent
between subfigures. Each subplot is captioned with the settings used
in training to produce the model. . . . . . . . . . . . . . . . . . . . . 28

A.1 Barplots with specificity data, grouped based on antibiotic classes.
Each bar is coloured based on the type of run that was performed: No
pre-training (green), Easy pre-training (blue) and Hard pre-training
(purple). The saturation of colours indicates the complexity of the
model: 1 encoder block (light), 3 encoder blocks (medium) and 12
encoder blocks (dark). Each subplot is labelled with which type of
antibiotic class is present and whether the fine-tuning was done with
an easy- or hard masking difficulty. . . . . . . . . . . . . . . . . . . . V

A.2 Scree plots for the PCA-CLS analysis for all selected models. Each
subplot is captioned with the settings used in training to produce the
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

xiv


List of Tables

5.1 Overview of the pre-training including model name, parameters, train-
ing time and resulting accuracy for each model. The models are
named with a number corresponding to the encoder blocks followed
by a letter indicating if the training was done with an easy (E)- or
hard (H) masking percentage. . . . . . . . . . . . . . . . . . . . . . . 18

5.2 The five most common ARGs in the Top- and Bot groups of points
of interest for all tested models in PC1. The number of occurrences
for each ARG is presented in parenthesis after the ARG. . . . . . . . 23

5.3 The five most common ARGs in the Top- and Bot groups of points
of interest for all tested models in PC2. The number of occurrences
for each ARG is presented in parenthesis after the ARG. . . . . . . . 25

5.4 The five most common ARGs in the Top- and Bot groups of points
of interest for all tested models in PC3. The number of occurrences
for each ARG is presented in parenthesis after the ARG. . . . . . . . 27

A.1 Recorded sensitivity for each model- and antibiotic in the study for
easy fine-tuning (E). The model names are listed as the number of
encoder blocks followed by the masking percentage for the type of
fine-tuning- and pre-training used. The type of pre-training was either
easy (E), hard (H) or no pre-training (N). . . . . . . . . . . . . . . . I

A.2 Recorded sensitivity for each model- and antibiotic in the study for
hard fine-tuning (H). The model names are listed as the number of
encoder blocks followed by the masking percentage for the type of
fine-tuning- and pre-training used. The type of pre-training was either
easy (E), hard (H) or no pre-training (N).. . . . . . . . . . . . . . . . II

A.3 Recorded specificity for each model- and antibiotic in the study for
easy fine-tuning (E). The model names are listed as the number of
encoder blocks followed by the masking percentage for the type of
fine-tuning- and pre-training used. The type of pre-training was either
easy (E), hard (H) or no pre-training (N).. . . . . . . . . . . . . . . . II

A.4 Recorded specificity for each model- and antibiotic in the study for
hard fine-tuning (H). The model names are listed as the number of
encoder blocks followed by the masking percentage for the type of
fine-tuning- and pre-training used. The type of pre-training was either
easy (E), hard (H) or no pre-training (N).. . . . . . . . . . . . . . . . III

xv


List of Tables

xvi


1
Introduction

Since the start of the 20th century, human life expectancy has seen a rapid increase.
In fact, between the years 1900- and 2021, the expected lifespan of a newborn has
increased by 122%: from 32 years to 71 years [1]. One of many important factors in
this positive change has been the health care’s ability to consistently treat common
diseases associated with bacterial infections with the use of antibiotic substances [2].
After the discovery by Sir Alexander Fleming in 1928, the first antibiotic substance
was commercially available on the market in 1945. This marked the start of the
golden era of the discovery of novel antibiotics, with the last new class of antibiotics
being discovered in the 1970s at the current time of writing [3]. Ever since, the
demand for antibiotic substances has continued to grow to maintain the continuous
growth in health standards. As a reference, the global consumption of antibiotics
has seen an increase of 48% since the start of the millennium [4].

However, with the abundance of usage, a global increase in bacterial antibiotic re-
sistance (AR) has been observed ever since the substances were introduced as a
treatment option. The first documented signs of antibiotic resistance date back to
1942 when strains of Staphylococcus aureus were found to resist the action of peni-
cillin in hospitalized patients during trials. The rate of resistance continued to grow
yet was not deemed as problematic during the golden era of the discovery of novel
antibiotics since any discovered resistance was combated with a newly developed
class of antibiotics. However, the ability to develop new classes of antibiotics has
not continued to limit the growth of AR [5]. Bacterial pathogens with the ability
to resist conventional treatment alternatives challenge many of the advancements
made in modern medicine [6]. It was estimated that AR contributed to 4.95 mil-
lion deaths worldwide in 2019 and was deemed the core reason for 1.27 million of
those deaths [7]. Moreover, antibiotic substances play an important role in a vari-
ety of medical treatments- and procedures not directly associated with combating
an ongoing infection including areas such as surgery, cesarean sections and cancer
chemotherapy. This increases the vulnerability of the patient when undergoing these
procedures, as post-treatment infections may be as difficult to treat as the reason
for the procedure itself [6]. In addition to the burden AR has on the healthcare
system, The World Bank estimates that AR could result in $1 trillion in additional
healthcare costs by 2050 and between $ 1-3.4 trillion GDP losses per year by 2030 [8].

Bacteria can acquire AR by gaining antibiotic resistance genes (ARGs) which en-
codes for resistance mechanisms needed to withstand the presence of the substances.
This includes, but is not limited to, changed permeability in the outer membrane,

1


1. Introduction

developed efflux pumps to expulse specific substances and enzymatic modification or
degradation. Importantly, these resistance mechanisms do not have to be substance-
specific but can change the susceptibility of a pathogen to a wide range of antibiotics
[9, 10]. ARGs can be obtained through mutations in an individual genome or by
receiving transferred genetic material, such as plasmids, from its surroundings [11].
These mobile genetic elements can contain a large series of ARGs creating a possible
multiresistent profile to different classes of antibiotics. This complicates the deci-
sion of which antibiotic substance to use during treatment as the resistance profile
of the bacteria in question is usually not known to the physician, thus leading to
an increase in antibiotic usage to find a treatment alternative. This may lead to a
vicious cycle further decreasing our ability to treat what has recently been treatable
[12].

The resistance profile of bacteria can be determined by cultivation in the presence
of different antibiotics to determine the degree of susceptibility. The methodology
of this test produces well-interpretative results but requires time and laboratory
precision to not contaminate tests [13]. Alternatively, the genetic information of
the bacteria could instead be studied to find the presence of ARGs and mutations
commonly associated with antibiotic resistance. PCR can perform this task whilst
being a fast and cheap alternative but is limited by the number of genes it can
test per run making the full resistance profile difficult to find [14]. Furthermore,
sequencing the entire genome can give the resistance profile of the individual but is
a more time-consuming and expensive method [15].

Deep learning is a field within artificial intelligence (AI) that has seen a recent
increase in attention from the public after many open-source projects has been made
available. Understanding- and communicating with human languages, recognizing-
and analyzing features in images and creating media from scratch are a few examples
of tasks previously not associated with machines. This has been made possible due
to recent advancements in the field, such as the introduction of transformers. The
ability to understand complex patterns by these networks has promoted the question
regarding if, and how, AI can be integrated into our health-care system. Networks
aimed to assist in diagnosis, suggest treatment plans and guide physical robots in
surgical procedures are all examples of AI in healthcare with promising results [16].

2


1. Introduction

1.1 Aim and Scope
The thesis aims to create an AI model framework using state-of-the-art deep learn-
ing techniques to predict resistance to commonly used antibiotics from genetic data.
This framework is supposed to contribute knowledge on important factors when in-
tegrating transformers into the field of AI-guided antibiotic resistance predictions.
This includes architectural aspects of how performance is related to model complex-
ity and embedding size, to aspects of how masked language training with genomic
data affects the ability to make accurate predictions. Moreover, the thesis aims to
investigate if self-supervised pre-training can improve performance, and if so, how
to make efficient use of the available data. Furthermore, the thesis aims to study
the decision process by different models to understand the data patterns the models
has found. With this, discovered features and relations in the genomic data aim to
increase the knowledge of underlying factors of antibiotic resistance.

The thesis will be limited to only study genomic data from the species Escherichia
coli as data from other species will be excluded. Furthermore, the thesis will limit
the number of studied antibiotics to 24. This means that all models will only
predict resistance to the included antibiotics in the study. Since the thesis aims to
create a framework to integrate transformer-based models into the field of antibiotic
resistance predictions, no other types of deep learning architectures will be studied.

3


2
Background

Artificial Intelligence (AI) is a field within computer science using mathematical
modelling to learn structural patterns with the data without guidance- or interfer-
ence from an external source. The experience the model has gained from training
on a data set can be used to influence decisions the model makes when presented
with new unseen information. The subject of AI and machine learning (ML) dates
back to 1943 when published work featured a simplistic mathematical model that
resembled the human neurons which was able to process boolean information to de-
cide when to activate [17]. Due to restrictions in both hardware and available data,
the field of AI saw limited interest levels from both academia and the public for
the remaining decades of the 20th century [18]. With the introduction of graphical
processing units (GPUs) and better data storage options around the turn of the
millennium, the field of AI saw a resurgence. More complex networks were now able
to be trained on larger sets of data which began the era of deep learning (DL) [17].

Teaching DL networks can primarily be done with two techniques depending on the
task the network is aimed to learn. When giving a network a large quantity of data
without additional information the network can cluster similar data together from
found patterns which is called unsupervised learning. If the network is aimed to
classify data to a subset of predetermined classes, either on a binary- or multi-class
level, the use of additional labels to the training data can help teach the network
which features to associate with certain labels, also called supervised learning. Dur-
ing training, the network needs some feedback on the performance of the desired
task to improve. The use of a loss function to determine a numeric estimate of
performance is commonly used in ML. The calculated loss can then be used to de-
termine gradients to be back-propagated through the system to tune the trainable
parameters in the system to better fit the data. This parameter optimization process
is done with an optimization algorithm, commonly selected to be stochastic gradient
descent or adaptive moment estimation (Adam). During training, the network is
constantly validating the performance on a subset of the data not available when
tuning the parameters. This type of validation is done to ensure that the model is
robust enough to process new data and does not overfit on specific patterns found
in the training data [19].

The architectural structure of the network is dependent on the task. In the field
of natural language processing (NLP), a sub-field of computer science where the
aim is for computers to understand and utilize human language, the data is often
structured as position-dependent sequences. Recurrent neural networks (RNN) were

4


2. Background

commonly used to solve NLP tasks for many years due to their ability to capture
sequential dependencies. However, RNNs were often not robust enough to capture
long-range dependencies in sequences and had to be implemented with additional
features to overcome the problem of vanishing gradients during back-propagation.
Moreover, the recurrent nature of the network made parallel calculations of multiple
parts of the data impossible which slowed down training. These two limitations for
the RNN motivated the development of a new model architecture not dependent on
the recurrent nature previously used for sequential data: the transformer [20]. The
transformer uses a self-attention mechanism to compute weighted representations
for each position to understand contextual relationships in a sequence. The model
is also not dependent on the positions in a sequence compared to the RNN architec-
tures. Instead, the full sequence is fed as the input to the network and additional
information such as position, order and more are saved as separate encodings. This
means that the transformer model is not limited by the same parallelization con-
straints as the RNN networks [21].

When training a transformer-based model for an NLP task it is common to use
Masked Language Modeling (MLM), a self-supervised pre-training technique before
adapting the network to a specific aim. This self-supervised learning is performed
by hiding away a subset of the data to create self-imposed multiclass-classification
problems [22]. This is done by masking out parts of the sequences and then asking
the network what was hidden from it. This aims to give the network a better under-
standing of the language as a whole before the model is adapted to a specific task via
fine-tuning techniques. The masking of the data can be changed for each iteration
of the entire data set, denoted an epoch, whilst keeping a similar percentage of the
number of masked words or having the masking remain constant during the entire
training phase. This difference in masking techniques is called dynamic- or static
masking respectively [23].

With the state-of-the-art performance in many areas of NLP, the idea of trying
transformer-based models and applying the self-attention mechanisms in other areas
were not far off [24]. Transformers have been applied in computer vision to perform
tasks such as image recognition, semantic segmentation and object detection and
have shown outstanding, and sometimes, state-of-the-art performance [25]. In a
similar fashion to the sequential nature of the data, transformers were introduced
to perform tasks within the fields of chemistry and biology. This has seen a variety
of successful applications, such as Chemformer which works to perform sequence-to-
sequence and discriminative cheminformatics tasks and AlphaFold which can predict
3D- structures of proteins from amino acid sequences [24, 26].

5


3
Theory

The following chapter will provide the theoretical aspects needed to understand the
methodology and results of the thesis.

3.1 Tokenization & Embedding space
For the model to be able to understand a sequence of ARGs, the input sequence is
transformed from a series of letter representations to a vectorized format. The first
step in this process is called tokenization of the data where each ARG in the input
sequence of length n is represented as a token from a vocabulary created with the
N unique elements in the data set. Each n elements in the sequence is converted
to a one-hot encoding vector of length N where the position of the given in the
vocabulary is set to 1 and every other position is set to 0. With this, all one-hot
encoding vectors are concatenated with a sequential start vector creating a one-hot
encoding matrix M ∈ RN×n+1, which can be expressed as follows [2].

M =



1
0
0
...
0
0
0


concatenated

∥



0 0 . . . 0 0
0 0 1 0
0 0 0 0
... ... . . . ... ...
0 1 0 0
1 0 0 0
0 0 . . . 0 1


N×n+1

The sequential start vector is regarded as a special token in the system, often denoted
as the CLS -token. The CLS -token marks the start of a sequence and is used for
classification tasks [27]. The one-hot encoding matrix is then transformed into an
embedding space. Each vector in the one-hot encoding matrix is mapped to a vector
of length dE which is the predetermined dimension of the embedding space. Unlike
the fixed nature of the one-hot encoding matrix, the embedding parameters are
aimed to be updated to achieve the goal of representing the contextual meaning of
the tokens in the sequence [2].

3.2 Self-attention
Self-attention uses attention functions to calculate weighted averages to gain con-
textual information from the token embeddings. The attention function computes

6


3. Theory

an output sequence Y A = (y1, ..., yn+1), of the same length n + 1 as the embed-
ded input sequence XE = (x1, ..., xn+1) by using a set of linear transformations to
map the two. The linear transformation is based on the matrices Q containing n+1
query vectors specific for each token, K containing the n+1 key vectors used as the
contextual reference points and V to compute the desired weighted average of our
output. The matrices Q, K and V can be can be calculated as follows, presented in
Equation 3.1,

Q = WQXE,

K = WKXE,

V = WV XE

(3.1)

where WQ, WK and WV are weight matrices with trainable parameters. The output
sequence Y A from the attention function can be determined as follows, presented in
Equation 3.2,

Y A = hsm

(
QKT

√
d

)
V (3.2)

where
√

d is a scaling factor dependent on the dimension of the weight matrices WQ,
WK and WV introduced to prevent vanishing gradients in the system [28]. Equation
3.2 contains the activation function hsm which is a softmax operation on the scaled
inner-product between Q and KT . The softmax activation function is defined as
follows, see Equation 3.3,

hsm(pi) = epi∑m
k=1 epk

(3.3)

for a given set pi ∈ [1, ..., m] [2].

3.3 Multi-head attention
When using self-attention techniques to learn contextual information in a trans-
former model, a common approach is to linearly project the querys keys and values
m different times in so called attention heads. The multi-head attention approach
can project the input into multiple different subspaces simultaneously and can there-
fore compute multiple attention functions independently of one another to get a
better understanding of the dependencies of the data. The heads work as parallel
self-attention layers on the same embedded input, each generating an updated at-
tention embedding Y A

i i ∈ [1, ..., m], which is concatenated into a collective output
Y A∗ ∈ RNm×n+1 from the parallel layers [29].

The dimension of the concatenated output Y A∗ is reduced through a linear trans-
formation with weight matrix W O with trainable parameters to produce the output
of the multi-head attention layer Ŷ A ∈ RN×n+1. The dimensions of the output after
the linear transformation with W O equals the embedded input. This dimensional
consistency is necessary for information to be passed through multilayered networks,
see Equation 3.4 [29].

7


3. Theory

Ŷ A = W OY A∗ = W O



Y A
1

Y A
2
...

Y A
m−1
Y A

m

 (3.4)

A graphical illustration of the multi-head attention process is presented as Figure
3.1.

Figure 3.1: Illustration of a multi-head attention layer as a flow chart. The input
XE is projected to the m self-attention heads for multi-head attention (orange).
Each of the m head follows the self-attention principle (grey) where the weight
matrices used are pointed to the related block. The output from each head is con-
catenated (yellow) and linearly transformed (blue) to produce the resulting output
Ŷ A from the multi-head attention layer.

3.4 Encoder
The encoder of a transformer is built as a series of encoder blocks. Each encoder
block starts with a multi-head attention layer using a predefined number of heads m
followed by an Add&Normalize (A&N) layer. The A&N sums the input embeddings
XE with the output from the multi-head attention layer Ŷ A and normalizes the sum
of the embedding matrices, see Equation 3.5 [30],

Ȳ A = Ŷ A + XE,

hnorm(Ȳ A) = Ȳ A − µl

√
σ2l

(3.5)

8


3. Theory

where Ȳ A is the intermediate residual connection followed by µl and σ2l which are
the mean- and variance of the summed matrices respectively. Following the A&N
layer is a position-wise feed-forward layer (FFN) which is a fully connected neural
network applied to each position. The FFN in the network consists of two linear
transformations with a rectified linear unite (ReLU) activation function connecting
them, see Equation 3.6,

hF F N(Ȳ A) = max(0, (W1Ȳ
A + b1))W2 + b2 (3.6)

where W1 and W2 are weight matrices and b1 and b2 are bias vectors. Conclusively,
another A&N layer is added after the FFN layer completing an encoder block. Im-
portantly, the final output of the encoder block Ȳ A shares the dimensional properties
with the original input XE meaning that the output from one encoder block can be
fed into another encoder block. By stacking encoder blocks, the encoder gets pro-
gressively more complex by the number of trainable parameters available to learn
the task at hand. Results have shown that increasing the model complexity has led
to better performance on both large tasks such as language modelling and machine
translation but can also improve results on smaller scaled tasks if trained properly
[30].

3.5 Pre-training with MLM
Pre-training a transformer model using the MLM technique gives the model the
ability to tune trainable parameters to general patterns in unlabeled data before
future fine-tuning to perform a specific task requiring data-based guidance, such
as classification. Given a data set B with NS sequences, defined as B = ∑NS

i=1 Xi,
a subset of the tokens in each sequence Xi will be probabilistically changed with
probability pm. The tokens have a probability of being changed to a MASK -token
or a randomly selected token from the vocabulary [30].

For a given Xi ∈ B, the masked tokens which were randomly selected are denoted as
the subset nu with one-hot encoding vectors xu

i , i = 1, ..., nu. The MLM pretraining
is done by minimizing the cross-entropy loss between selected columns of the output
from the last encoder block Ȳ A and the one hot encoding vectors xu

i . The selected
columns from Ȳ A is based on which tokens are masked in the subset nu, see Equation
3.7,

arg min
ϕ

nu∑
i=1

−log( eȳA
i∑

N eȳA
i

)xu
i (3.7)

where ȳA
i is the i:th column of Ȳ A and N is the size of the vocabulary. ϕ represents

all trainable parameters in the model [2].

3.6 Transformer-based binary classification
Using the output from a transformer-based model for classification tasks often re-
quires additional network structures added to the model. With this, the transformer

9


3. Theory

and the classification networks are trained in unity to solve a desired task. The com-
bined model is then trained using supervised learning on labelled data with the aim
of being able to accurately assign a binary label to an input sequence given to the
transformer. Commonly, the classification networks are only given the CLS -token,
or more specifically the first column of the output matrix Ȳ A, from the encoder
to base the classification decision on. This is due to computational efficacy as the
dimension of the input to the classification networks without this limitation would
be overwhelming depending on the size of the embedding- and sequence length. The
CLS token is therefore trained to be an aggregated representation of the contextual
information needed from the sequence for classification [2].

Training the combined model aims to minimize the binary cross-entropy loss between
the predicted class xpred from the classification networks and the true class xtrue,
formulated in Equation 3.8,

arg min
Φ

NP∑
i=1

xtrue,ilog(p(xpred,i)) + (1 − xtrue,i)log(1 − p(xpred,i) (3.8)

where p(xpred,i) is the probability of the model assigning a positive class on the
i:th prediction and NP is the total amount of predictions made by the model. Φ
represents all trainable parameters in the combined model [2, 19].

3.7 Overfitting & solutions
A problem which is often encountered within the field of ML is that the model learns
specific patterns in the training data too well which leads to the model not being
able to perform on unseen data, so called overfitting [32]. A multitude of solutions
has been presented to combat overfitting, one such being the dropout technique.
Dropout refers to temporarily removing units, alongside the ingoing- and outgoing
signals from said units, from the network during training. This selection is done
randomly with a predetermined probability pd [33].

3.8 Principal component analysis
Principal component analysis (PCA) is a linear dimensionality reduction technique
that aims to describe variance in a system of inter-correlated dependent variables.
Given a data set B with p features of n individuals, a set of p × n feature vectors bi

∈ B can be defined. The feature vectors exist in a p-dimensional feature space and
the PCA method can be interpreted as a method to rotate the coordinate axis in
feature space such that the primary component is aligned with the maximum possible
data dispersion. Continuing, the secondary PC-component aims to be aligned with
second most the data dispersion, and so on until all p PCA components have been
aligned. With this, the data can be analysed in a lower dimensional setting by
studying a subset of the PC-components describing the important variance in the
system. The percentage of variance each PC-component describes can be visualized
in a Scree-plot. [34, 35].

10


3. Theory

3.9 Evaluation metrics
A common approach to evaluating a model trained with supervised learning is to
measure the accuracy of the predictions. This entails the proportion of correctly as-
signed labels to the total number of predictions made by the model. However, this
approach is not always the best fit to estimate how well the model is performing.
Sensitivity and specificity are two metrics describing accuracy in detecting the pres-
ence or absence of a given characteristic. Both metrics are calculated by counting
instances of predictions deemed true when the prediction is accurate to the actual
outcome or false when the prediction is the opposite of the actual outcome. With a
binary set of classes labelled positive and negative, the metric sensitivity measures
the ability to correctly capture the positive class and specificity the negative class.
The metrics are defined in Equation 3.9a and Equation 3.9b [36].

Sensitivity = True positive
True positive + False negative (3.9a)

Specificity = True negative
True negative + False positive (3.9b)

11


4
Methods

The following chapter will present the methodology regarding data processing, model
architecture and designs of experiments for the thesis.

4.1 Data selection & pre-processing
Data were provided by the National Center for Biotechnology Information (NCBI).
The data set was published on the 31:st of March and contained 370,673 samples
of information and analysis results from genome sequencing of Escherichia coli. Of
those, 364,820 (98.42%) of samples contained core genotype information used for
pre-training and 7,701 (2.12 %) of samples contained both core genotype- and an-
tibiotic phenotype information used for fine-tuning.

Complementing the genotype- and phenotype information, the recorded location
and year of collection were added to the data set. During pre-processing, all sam-
ples missing core genotype information were removed from the data set. Moreover,
an inclusion threshold was set at the year 1970 as previously recorded data were
deemed unreliably collected for the study. If a gene were recorded in an incomplete
way, the specific gene in question was removed. With this, the data set resulted in
364 462 samples with core genotype information and 6,486 samples with both core
genotype- and phenotype information.

For the phenotype information, only recorded instances of resistant (R) or suscep-
tible (S) for antibiotics were kept in the data set. Furthermore, to avoid creating a
heavily unbalanced dataset, each antibiotic was analyzed separately where the re-
spective frequency dictated an accepted percentage threshold of recorded instances
of R and S. Three different groups were formed. Group 1 contained all antibiotics
which were observed in the data set less than or equal to 100 times. Group 2 con-
tained all antibiotics which were observed between 100- and 1000 times in the data
set. Group 3 contained all antibiotics observed more than 1000 times in the data
set. The threshold for the percentages was set to be 65/35 for Group 2 and 92/8 for
Group 3, no matter which of the R- or S labels were in the majority. All antibiotics
in the respective groups which complied with the given thresholds were kept, the
rest were removed from the data set. Moreover, a threshold was not set for Group 1
as the antibiotics in this group were deemed too uncommon and were subsequently
removed. Importantly, three antibiotics within the class Carbapenems were kept in
the data set even though they did not meet the required thresholds. With this, the

12


4. Methods

frequency distribution- and R/S percentage for each of the 24 included antibiotics
in the study are presented in Figure 4.1a- and Figure 4.1b respectively.

(a) Frequency distribution. (b) R/S percentages.

Figure 4.1: Frequency distribution- and R/S percentages for each studied antibi-
otic. The order of the antibiotics is consistent between subfigures.

4.2 Vocabulary
A vocabulary was built using all possible tokens that could be fed to the encoder.
This includes every unique instance of different genes, locations and years recorded in
the data set. In addition to this, four special tokens were included in the vocabulary.
These include CLS, PAD, MASK and UNK, and were added manually as the first
elements of the vocabulary. The vocabulary used in the project contained a total of
1231 tokens.

4.3 Masking sequences
Using the pre-processed data, each recorded sample was transformed into a data
sequence. Each sequence starts with a CLS -token, followed by the year- and loca-
tion for the observation and thereafter the ARG-tokens. To homogenize the sequence
lengths in the data set, each sequence was extended to a predetermined length. This
was set to a length of 51 to match the longest recorded sequence. Each sequence
that did not reach the maximum length were extended by adding PAD-tokens at the
end of the sequence. Moreover, if a feature of data were missing for an observation,
a PAD-token were added in place of the missing information. These PAD-tokens
were used as information fillers to maintain the consistency between sequence. They
did not influence the training as models were passed information to only focus the
self-attention on all other non- PAD-tokens in the sequence.

The masking process randomly selects a subset of the tokens in each sequence ac-
cording to a predetermined masking probability, with a condition that at least one
token in each sequence is selected. Moreover, no tokens of type PAD were selected.

13


4. Methods

Each selected token has an 80% chance to be replaced by a MASK -token, a 10%
chance to be replaced by a randomly selected token from the created vocabulary
and a 10% chance to remain the same. The indexes- and original tokens were saved
to use for training purposes. A visualisation of the masked sequences process can
be found in Figure 4.2

Figure 4.2: Illustration of the masked sequencing process. The three steps of the
process are listed to the left. The tokens of the sequence are illustrated as labelled
boxes with additional type-specific colours: CLS -token (yellow), ARG-token (green),
PAD-token (blue) and MASK -token (red). Additional information on the position
of the tokens is presented in red numbers above the respective steps of the process.

4.4 Building the model
The model framework for the thesis consisted of two parts: an encoder and a series
of parallel linear classification networks, presented in Figure 4.3. The following
chapters will describe the two parts in depth.

Figure 4.3: Illustration of the model presented as a flow chart. The model is di-
vided into two parts: encoder (blue side) and parallel classification networks (yellow
side). The tokenized gene sequence is shown as boxes labeled Ti and embeddings
are shown as boxes labeled Ei. The embedded output of the CLS -token is fed to
the parallel networks resulting in separate R/-predictions.

14


4. Methods

4.4.1 Encoder
The encoder was built with a set of encoder blocks, described in chapter 3.4, where
the multi-head attention layer consisted of 4 attention heads and had a dropout
probability pd of 20% for all encoder blocks. Each feed-forward layer in the encoder
blocks used the dimension for the embedding as the dimension for the fully connected
layers. An additional linear layer was implemented after the last encoder block with
the number of output neurons depending on the vocabulary size. This layer made
the ARG-token predictions used during pre-training.

4.4.2 Parallel classification networks
A feed forward network was created for each unique type of antibiotic, 24 in total.
These networks were placed in parallel to one another to receive the same embedded
CLS -token output from the encoder to base the binary classification decision on.
Each network was built as a fully connected layer with the same dimension as the
token embeddings in the system, followed by a linear layer to produce a single output
as the prediction. Moreover, a ReLU activation function and a layer normalization
were implemented between the two layers.

4.5 Training
Two types of training were performed during the project. Self-supervised pre-
training was performed with the MLM technique from data which contained core
genotype information. Any data which contained both core genotype- and antibi-
otic phenotype information were excluded from the pre-training. As the pre-training
only aimed to train the encoder to understand the genomic data, the parameters
in the parallel networks were not tuned. Moreover, when fine-tuning the model,
the models were subjected to supervised learning which used data that contained
labeled genomic- and antibiotic phenotype information. Any masking done during
fine-tuning aimed to increase the difficulty for the models to predict the binary la-
bels, as the models were not trained on predictions of the missing ARG-tokens in
fine-tuning.

When training the models, both during pre-training and fine-tuning, a set of hyper-
parameters were fixed. To begin with, the data set was separated into two groups
for training- and validation with a split percentage of 80/20 for the respective set.
Each model was trained a maximum of 100 epochs with a learning rate of 10−7

and a batch size of 32. The loss was calculated with different functions depending
on whether the aim was to pre-train or fine-tune. For pre-training, the function
cross entropy loss was used and for fine-tuning a modified version of the binary
cross entropy loss was used. The network was optimized using a modified version of
the adaptive moment estimation (Adam) optimization algorithm with implemented
weight decay. The weight decay was selected to be 10%.

All training was done on the Alvis cluster by NAISS. GPUs used in the thesis was

15


4. Methods

NVIDIA Tesla V100 for pre-training and NVIDIA A40 for fine-tuning.

4.5.1 Early stopping
A threshold was implemented to stop the training if early signs of overfitting in the
network were detected. Training which continues beyond this detected point would
not benefit the model which motivated the decision to stop the training to save time
and computational resources. This was done by comparing calculated losses in the
validation set between epochs. If the validation loss had not seen a decrease over 7
epochs the training would terminate and the model for which the lowest validation
loss would be saved.

4.5.2 Selective freezing
Since most isolates in the data set only contained information regarding the sus-
ceptibility of a subset of the 24 studied antibiotics, a selective freezing method was
implemented in training. For a given batch of data in the training, before using the
calculated loss to update the parameters in the network the indices of which an-
tibiotics were present in the batch was checked. This information was subsequently
used to freeze the parameters in the parallel networks associated with all antibiotics
not present in the respective batch to avoid being trained on information not related
to them. The selective freezing process was iterative, as all parallel networks were
unfrozen after the optimizer had updated the parameters for a given batch.

4.6 PCA analysis of the CLS-tokens
Fine-tuned models were further analysed by isolating the CLS -tokens for each isolate
in the data set. The CLS -tokens were isolated by loading a state dict of a trained
model and saving each CLS -token from the output of the encoder. In addition,
supplementary isolate specific information such as number of genes in the sequence,
presence- or absence of a specific gene and more were concatenated with the data
frame of the CLS -tokens for future analysis.

A principle component analysis was then performed on the columns of the data frame
related to the CLS -tokens. By selecting a subset of the calculated components
which described a considerable part of the dispersion, the CLS -tokens could be
visualized and studied as a scatter plot in two dimensions. With the supplementary
information mentioned earlier, colour systems were implemented in the scatter plots
to understand how the model clusters- and organizes CLS -tokens based on studied
features.

16


5
Results & Discussion

The following chapter will present- and discuss the gathered results from the thesis.

5.1 Pre-training

Six models with varying degree of model complexity- and masking difficulty were
pre-trained according to the methodology presented in Chapter 4.5. The three
complexity levels selected for pre-training were 1 encoder block, 3 encoder blocks
and 12 encoder blocks. Each complexity level was trained with a 30% or 60%
random masking percentage, denoted as easy- or hard training. All models used an
embedding size of 512. None of the models encountered the early stopping condition
and were trained for the maximum length of 100 epochs. An overview of the training-
and validation losses for all pre-trained models are presented in Figure 5.1.

(a) Training loss. (b) Validation loss.

Figure 5.1: Training- and validation loss during pre-training for each model. The
names of the models are constructed with the complexity of the model is denoted
with a number (1 (blue), 3 (pink) or 12 (black)) followed by a masking difficulty
denoted with a letter (E for easy, H for hard). Labels are consistent between sub-
figures.

A descriptive overview of the results from the pre-training is presented in Table 5.1.
The presented accuracy describes the percentage of correctly assigned ARG-tokens
to the masked tokens in the sequences for the last epoch of training.

17


5. Results & Discussion

Table 5.1: Overview of the pre-training including model name, parameters, training
time and resulting accuracy for each model. The models are named with a number
corresponding to the encoder blocks followed by a letter indicating if the training
was done with an easy (E)- or hard (H) masking percentage.

Model name Parameters Training time (hours:minutes) Accuracy
1E 10,455,247 04:50 0.3942
1H 10,455,247 04:56 0.2688
3E 28,840,143 11:32 0.399
3H 28,840,143 11:37 0.2719
12E 111,572,175 17:20 0.348
12H 111,572,175 17:20 0.2593

Observing Table 5.1, the resulting accuracy of the models trained with an easy mask-
ing percentage was higher than the models trained with a hard masking percentage.
This falls in line with the background, as the models with access to more infor-
mation should achieve a higher performance for a given task. Notably, an increase
in accuracy can be seen with an increase in model complexity between 1- and 3
encoder blocks for both masking percentages. With a further increase in number of
parameters in the network, one would assume that an increase in accuracy would be
observed between 3- and 12 encoder blocks as well. However, the models with 12 en-
coder blocks instead saw a decrease in accuracy for both masking percentages. One
possible explanation for this observed phenomenon is the shared training length-
and learning rate between models. If the training were adjusted to an unlimited
number of epochs with the early stopping criteria, an accuracy difference dependent
on complexity could possibly be found. This is supported by Figure 5.1b where the
validation loss curves for models with 1- and 3 encoder blocks have stagnated more
than the validation loss curves with 12 encoder blocks. This indicates that more
epochs would be needed for models with 12 encoder blocks. However, this was not
tested during the thesis due to constraints in computational resource management
and time.

5.2 Hyperparameters in fine-tuning
Models without pre-training were fine-tuned with different hyperparameters to study
the influence on model performance. The three hyperparameters tested in the thesis
were embedding size, number of encoder blocks and the percentage of masking used
in fine-tuning. Two experimental setups were conducted where two of the selected
hyperparameters were varied and the other hyperparameter were fixed. The pre-
sented accuracy in the experiments describes the percentage of correctly assigned
R/S labels for all tested antibiotics in the last epoch of training.

5.2.1 Experimental setup 1
The first experimental setup tested how the model complexity was influenced by
masking percentage in fine-tuning. The models were built with varying numbers

18


5. Results & Discussion

of encoder blocks ranging from 1 block to 12 blocks with intermediate levels. All
models were tested at three levels of masking percentage: 25% (easy), 50% (medium)
and 75% (hard). The embedding size for all models tested in the experiment was
set to 256. The recorded accuracies for all tested models in experiment 1 can be
found in Figure 5.2.

Figure 5.2: Bar diagram for the recorded accuracies for each model tested in
experiment 1. The colour saturation for the bars indicates the complexity of the
respective model for each masking percentage, labelled to the right.

Observing Figure 5.2, a trend between increasing model complexity and increased
accuracy can be seen for all tested levels of masking percentage. The model performs
better when using a lower masking percentage in fine-tuning which falls in line with
the theoretical aspect of access to more information resulting in higher performance.
Notably Figure 5.2 shows the importance of model complexity when limiting the
information that is given to the models. A model with 12 encoder blocks that was
trained on a hard masking percentage performs roughly on par with a model that
has 3 encoder blocks but was trained with a medium masking percentage.

5.2.2 Experimental setup 2

The second experimental setup studied how varying embedding sizes influenced the
performance for a range of model complexities. The tested embedding sizes were
selected from the power of two series, ranging from 32 up to 1024. All embedding
sizes were tested on three different model complexities selected to be 1, 3 and 6
encoder blocks. A fixed masking percentage of 60% was used for all models. The
recorded accuracies for all tested models in experiment 2 can be found in Figure 5.3.

19


5. Results & Discussion

Figure 5.3: Bar diagram for the recorded accuracies for each model tested in
experiment 2. The colour saturation of the bars indicates the embedding size used
for each model complexity, labelled to the left.

Observing Figure 5.3, a trend between increasing embedding size and increased
accuracy can be seen. However, unlike in experiment 1, a stagnation in accuracy
can be seen between embedding sizes 512- and 1024 for all complexity levels. This
observation motivates the decision to continue using an embedding size of 512 in
future experimental setups as no significant decrease in accuracy can be expected
compared to selecting a larger embedding size.

5.3 Evaluating performance
The performance seen from the models in the experimental setups presented in chap-
ter 5.2 indicates that the models are capable of learning- and understanding genetic
data. The models can then utilize that gained knowledge to assign susceptibility
labels with a high accuracy depending on the selected hyperparameters. However,
due to imbalances in the training data seen in Figure 4.1, a model has the possi-
bility to score a high overall accuracy by learning to always guess the label which
corresponds to the higher fraction for each antibiotic. This motivates the decision
to introduce a secondary evolution metric to study model performance.

A third experiment was conducted to study differences in sensitivity- and specificity
for each individual antibiotic with different model complexities. Three levels of
complexity were set to 1, 3 or 12 encoder blocks. In addition to this, each complexity
level was tested with two different masking percentages set as 30% (easy) or 60%
(hard). Moreover, the influence of pre-training was also studied by investigating
differences in sensitivity and specificity between fine-tuned models who received
easy- or hard pre-training and models who were fine-tuned without pre-training.
In total, the sensitivity- and specificity for all tested antibiotics for the 18 different
models can be found in Table A.1-, A.2-, A.3- and A.4 presented in Appendix 1. The
results from the mentioned tables were grouped based on antibiotic classes belonged
to and plotted as bar diagrams. The sensitivity bar diagrams for each class are

20


5. Results & Discussion

presented in Figure 5.5 and the specificity bar diagrams for each class are presented
in Figure A.1 in Appendix 1.

(a) Penic.&Monob., Easy. (b) Carbapenems, Easy (c) Cephalosporins, Easy.

(d) Penic.&Monob., Hard. (e) Carbapenems, Hard (f) Cephalosporins, Hard.

(g) Quino.&Others, Easy. (h) Amin.&Tetra., Easy . (i) Sulfonamides, Easy.

(j) Quino.&Others, Easy. (k) Amin.&Tetra., Hard . (l) Sulfonamides, Hard.

Figure 5.5: Barplots with sensitivity data, grouped based on antibiotic classes.
Each bar is coloured based on the type of run that was performed: No pre-training
(green), Easy pre-training (blue) and Hard pre-training (purple). The saturation of
colours indicates the complexity of the model: 1 encoder block (light), 3 encoder
blocks (medium) and 12 encoder blocks (dark). Each subplot is labelled with which
type of antibiotic class is present and whether the fine-tuning was done with an
easy- or hard masking difficulty.

Observing Figure 5.5, it can be concluded that some of the tested antibiotics are

21


5. Results & Discussion

easy for the model to correctly label as resistant. Such cases are LVX in panels g-
and j), CPD in panels c- and f) and SXT in panels i- and l) where neither model
complexity, pre-training nor fine-tuning masking percentage affected the ability to
correctly predict resistances. Importantly, the high sensitivity did not limit the
specificity of the antibiotics in question seen in Figure A.1 in Appendix 1. However,
certain antibiotics of the tested set show clear signs of how the model complexity
influences the ability to correctly predict resistance. Examples of this are in Figure
5.5h and 5.5k where all tested antibiotics show a significant increase in sensitivity
for a minor trade in specificity, see Figure A.1h and Figure A.1k in Appendix 1, for
both masking percentages. This observation is most notable for TOB- and GEN,
which are supposedly more difficult for the model to predict resistance for compared
to TET and STR. In addition to this, the effect of pre-training can also be seen
in this antibiotic class as pre-trained models levels- or outperforms non pre-trained
models in sensitivity for a low trade-off in specificity. This trend is evident for more
complex models on the difficult antibiotics TOB- and GEN.

Continuing, certain antibiotics in the study are dependent on the model complexity
to find resistances. The antibiotic AMC shows an almost 100% specificity for all
models in both levels of masking percentage in Figure A.1a and Figure A.1d in Ap-
pendix 1. In contrast to this, the sensitivity for AMC seen in Figures 5.5a and 5.5d
are low compared to AMP and AZT in the same group. A dependency on the model
complexity can be seen for both levels of masking percentage for AMC, in particular
for the hard masking percentage as only models with 3- and 12 encoder blocks have
a sensitivity above 0. This indicates that models built with 1 encoder block, no
matter the tested pre-training, are not complex enough to detect resistances for this
antibiotic in a setting where less information is given. Continuing, a similar conclu-
sion can be drawn for the antibiotics in class carbapenems. Observing Figure 5.5b
and 5.5e, only models with 12 encoder blocks are able to correctly find resistances to
antibiotics ETP, IMI and MEM which are the antibiotics with the most imbalanced
R/S percentages in the data set, see Figure 4.1. This answers an important ques-
tion regarding the importance of model complexity in the field: Only very complex
models have the capacity to accurately predict resistance from antibiotics where the
occurrence of resistance is rare. Moreover, the importance of pre-training is also
displayed in the same group as pre-training resulted in up to 3 times the sensitivity
without significant loss in specificity, see Figure A.1b and Figure A.1e in Appendix 1.

Observing Figure 5.5, the use of pre-training has an overall positive effect on the
sensitivity for a number of antibiotics as seen in the cases presented above. How-
ever, drawing a generalized conclusion on which type of pre-training is better for the
system is difficult. Observing CMP in Figure 5.5g and 5.5j, an increase in sensitivity
can be seen for all model complexities at both masking percentages with pre-training
without a significant loss of specificity, see Figure A.1g and Figure 5.5j in Appendix
1. However, the type of pre-training which resulted in the highest sensitivity dif-
fers. For fine-tuning with an easy masking percentage, the models benefited from
training on a hard masking percentage and the opposite was found when studying
the fine-tuned models on a hard masking percentage. This pattern is not found

22


5. Results & Discussion

when analyzing other antibiotics in the study. Examples of this are antibiotics in
the class cephalosporins in Figure 5.5c. For both CXT and CTR, the models with 3
encoder blocks benefited more from an easy masking percentage during pre-training
compared to the models with 12 encoder blocks which benefited more from a harder
masking percentage during pre-training. Even though the best type of pre-training
was situational, a common denominator between all three examples presented is
that both types of pre-training outperformed no pre-training at all which underlines
the importance.

5.4 Analyzing learning patterns
A PCA-CLS analysis was performed on all models built with 12 encoder blocks from
Chapter 5.3 to understand important features of the data that influence the decisions
made by the models. The number of PC-components to analyze was decided based
on the complimentary Scree-plots produced for each PCA-CLS analysis, see Figure
A.2 in Appendix 1. Based on Figure A.2 in Appendix 1, the first 3 PC-components
were selected to be studied further. The 100 data points with the highest- and low-
est values for each PC-component were selected as points of interest, denoted the
Top- and Bot group respectively. These points are of interest as they have signif-
icant disparity in the direction of the studied component. All chromosomal genes
with point mutations are presented as the original amino acid, the position of the
mutation followed by the substituted amino acid.

5.4.1 PC1: ARG quantity
The ARG content for the points of interest for PC-component 1 (PC1) was analyzed
by counting the number of occurrences for each ARG in the Top- and Bot groups.
With this, a list of the top 5 most common ARGs in each group was created, pre-
sented in Table 5.2.

Table 5.2: The five most common ARGs in the Top- and Bot groups of points of
interest for all tested models in PC1. The number of occurrences for each ARG is
presented in parenthesis after the ARG.

Top 1 2 3 4 5
12E-N glpT-E448K(92) gyrA-S83L(89) parC-S80I(85) gyrA-D87N(82) aph(3”)-Ib(72)
12E-E parC-S80I(97) gyrA-S83L(97) gyrA-D87N(92) glpT-E448K(88) aph(3”)-Ib(81)
12E-H gyrA-S83L(97) parC-S80I(96) gyrA-D87N(93) glpT-E448K(91) aph(3”)-Ib(85)
12H-N glpT-E448K(93) parC-S80I(90) gyrA-S83L(90) gyrA-D87N(88) sul1(71)
12H-E gyrA-S83L(97) parC-S80I(96) glpT-E448K(95) gyrA-D87N(95) sul1(87)
12H-H parC-S80I(99) gyrA-S83L(99) gyrA-D87N(96) glpT-E448K(91) sul1(85)
Bot 1 2 3 4 5
12E-N glpT-E448K(96) pmrB-Y358N(30) pmrB-E123D(9) cyaA-S352T(8) marR-S3N(8)
12E-E glpT-E448K(99) pmrB-Y358N(20) pmrB-E123D(18) marR-S3N(16) cyaA-S352T(12)
12E-H glpT-E448K(98) pmrB-Y358N(52) marR-S3N(26) pmrB-E123D(21) uhpT-E350Q(2)
12H-N glpT-E448K(91) pmrB-Y358N(23) marR-S3N(22) pmrB-E123D(15) cyaA-S352T(4)
12H-E glpT-E448K(99) pmrB-Y358N(27) pmrB-E123D(26) marR-S3N(11) cyaA-S352T(8)
12H-H glpT-E448K(99) pmrB-Y358N(22) marR-S3N(16) pmrB-E123D(12) cyaA-S352T(8)

23


5. Results & Discussion

Observing Table 5.2, the common genetic content differs between groups. ARGs
such as point mutations in the gyrA-gene and the point mutation parC-S80L are
common occurrences for the Top-group for all models. However, the quantity of
common ARGs in the Bot group is lower. Out of the 100 points of interest, only
one out of six models has counted the 5th most common ARG 10 times or more
compared to the Top-group where all models counted the 5th most common ARG
more than 70 times. This indicates that the overall ARG quantity in the Bot-group
is generally lower. Moreover the ARG that stands out for all models in the Bot-
group, the point mutation glpT-E448K, can also be found as a common ARG in the
Top-group which means that the ARG itself is a common occurrence and might not
be PC1 dependent. With this, the CLS -tokens for all sequences for each model were
visualized as a scatter plots in with additional colouring dependent on the number
of ARGs each isolate had, see Figure 5.6.

(a) Easy FT, No PT. (b) Easy FT, Easy PT. (c) Easy FT, Hard PT.

(d) Hard FT, No PT. (e) Hard FT, Easy PT. (f) Hard FT, Hard FT.

Figure 5.6: Scatter plot of PC1 & PC2 for each CLS token from the tested models.
Each CLS token for all subplots has been coloured depending on the number of
ARGs for the respective sequence. Colour palettes are consistent between subfigures.
Each subplot is captioned with the settings used in training to produce the model.

Observing Figure 5.6, a dependency of the quantity of ARGs along the PC1 axis can
be seen for all tested models. A large group of data points with a lower ARG count

24


5. Results & Discussion

has been grouped on the negative side of the PC1-axis with an arc shape towards the
positive PC1-axis where points with higher ARG counts are located for all models.
As all Scree-plots in Figure A.2 indicates that PC1 explains a significant aspect of
the disparity of the CLS -tokens, the models base a lot of the decisions on the number
of present ARGs. This observation is logical in the sense that the number of ARGs
in bacteria is one of the more fundamental aspects of the data. Having more ARGs
in a sequence gives the models more information to base the resistance predictions
on, and therefore, makes the quantity of ARGs the most important factor for the
models.

5.4.2 PC2: ARG groupings
Similar to PC1, the occurrences of ARGs were counted for the points of interest for
PC-component 2 (PC2). The list of the top 5 most common ARGs in the Top- and
Bot groups can be found in Table 5.3.

Table 5.3: The five most common ARGs in the Top- and Bot groups of points of
interest for all tested models in PC2. The number of occurrences for each ARG is
presented in parenthesis after the ARG.

Top 1 2 3 4 5
12E-N gyrA-S83L(97) parC-S80I(96) gyrA-D87N(92) glpT-E448K(85) uhpT-E350Q(77)
12E-E gyrA-S83L(100) gyrA-D87N(95) parC-S80I(95) glpT-E448K(80) uhpT-E350Q(71)
12E-H gyrA-S83L(99) gyrA-D87N(98) parC-S80I(98) glpT-E448K(85) uhpT-E350Q(72)
12H-N gyrA-D87N(100) gyrA-S83L(100) parC-S80I(99) glpT-E448K(92) uhpT-E350Q(87)
12H-E gyrA-S83L(100) parC-S80I(99) gyrA-D87N(96) uhpT-E350Q(77) pmrB-E123D(71)
12H-H gyrA-S83L(94) glpT-E448K(86) parC-S80I(85) gyrA-D87N(81) uhpT-E350Q(75)
Bot 1 2 3 4 5
12E-N glpT-E448K(95) aph(3”)-Ib(69) aph(6)-Id(67) tet(B)(63) sul2(61)
12E-E glpT-E448K(94) aph(3”)-Ib(74) aph(6)-Id(70) sul2(67) tet(B)(62)
12E-H blaTEM-1(90) glpT-E448K(89) aph(3”)-Ib(73) aph(6)-Id(70) tet(B)(63)
12H-N aph(3”)-Ib(92) aph(6)-Id(90) glpT-E448K(89) blaTEM-1(73) tet(B)(71)
12H-E aph(3”)-Ib(87) blaTEM-1(85) aph(6)-Id(84) glpT-E448K(83) sul2(71)
12H-H glpT-E448K(93) blaTEM-1(92) aph(3”)-Ib(69) aph(6)-Id(67) tet(A)(62)

Observing Table 5.3, a quantity difference of ARGs between group Top and group
Bot is not as apparent as in the case of PC1. This indicate that PC2 is not as de-
pendent on the number of ARGs in a sequence but rather the type of ARG present
in a sequence. From Table 5.3 a trend can be observed that the two point muta-
tions of gyrA and the point mutation parC-S80I are common occurrences in the
Top-group with almost every isolate carrying all three ARGs. Moreover, a similar
pattern can be observed in the Bot-group as the two aph ARGs are prevalent in most
of the points of interest. This dependency was studied by plotting the CLS -tokens
in PC1- and PC2 with additional colouring dependent on which ARGs each data
point carried. The isolates were given separate colours depending on if they had
both the gyrA-D87N and parC-S80I mutations, the aph(3”)-Ib and aph(6)-Id or
none of the ARGs. In addition, a fourth colour was introduced for data point which
carried all four of the previously mentioned ARGs. The scatter plots for PC1&PC2
with the introduced colour system for all studied models can be found in Figure 5.7.

25


5. Results & Discussion

(a) Easy FT, No PT. (b) Easy FT, Easy PT. (c) Easy FT, Hard PT.

(d) Hard FT, No PT. (e) Hard FT, Easy PT. (f) Hard FT, Hard FT.

Figure 5.7: Scatter plot of PC1 & PC2 for each CLS token from the tested models.
Each CLS token for all subplots has been coloured depending on the ARG content of
the sequence. Colour palettes are consistent between subfigures. Isolates with none
of the studied point mutations (purple), isolates with gyrA-D87N and parC-S80I
(blue), isolates with aph(3”)-Ib and aph(6’)-Id (green) and isolates with all studied
point mutation (yellow) are marked accordingly. Each subplot is captioned with the
settings used in training to produce the model.

Observing Figure 5.7, a dependency of the tested ARG groups along the PC2 axis
can be seen. All models separate the isolates containing the gyrA-D87N and parC-
S80I mutations from the isolates aph(3”)-Ib and aph(6)-Id. Any overlapping regions
between the studied ARG groups either contain both of the groups or none of them.
Point mutations in the gyrA and parC genes in E. coli encodes for resistance to
antibiotics in the quinolones class. In order to gain high level resistance against
the quionolones containing fluorine the bacteria needs more than one mutation in
the genetic content involved in the maintenance of DNA topology. Two of these
genes are gyrA and parC which explains why these ARGs are found together in the
data [37]. On the other hand, the mutations in the aph genes encode for resistance
to antibiotics in the class aminoglycosides [38]. With the significant segregation
of data the models have performed, a difference in classes that the studied ARGs
encode resistances for is expected. What is not yet understood is why the models
weigh resistance profiles to these classes so heavily compared to the other available
classes, such as penicillins. Moreover, other ARG which also encodes for resistance
against the highlighted antibiotic classes in PC2 are not prioritized by the model.

26


5. Results & Discussion

The qnr ARGs also encodes for resistance to quinolones but can not be found in the
Top-group where the other quinolones ARGs are present [39].

5.4.3 PC3: pmrB mutations
Finally, the ARG content for the points of interest for PC-component 3 (PC3) were
analyzed following the same methodology as previously mentioned. A list of the top
5 most common ARGs in each group for the third PC axis can be found in Table
5.4.

Table 5.4: The five most common ARGs in the Top- and Bot groups of points of
interest for all tested models in PC3. The number of occurrences for each ARG is
presented in parenthesis after the ARG.

Top 1 2 3 4 5
12E-N glpT-E448K(98) pmrB-Y358N(65) blaCMY-2(54) tet(B)(27) sul2(25)
12E-E glpT-E448K(94) blaCMY-2(54) pmrB-Y358N(47) tet(A)(42) cyaA-S352T(27)
12E-H glpT-E448K(96) blaCMY-2(81) pmrB-Y358N(44) cyaA-S352T(34) uhpT-E350Q(15)
12H-N glpT-E448K(97) blaCMY-2(54) pmrB-Y358N(45) cyaA-S352T(37) qnrS1(17)
12H-E glpT-E448K(67) tet(B)(40) qnrS1(38) sul1(37) aadA2(34)
12H-H glpT-E448K(90) tet(B)(80) pmrB-Y358N(62) aph(6)-Id(43) aph(3”)-Ib(43)
Bot 1 2 3 4 5
12E-N gyrA-S83L(86) parC-S80I(75) mph(A)(71) sul2(71) glpT-E448K(69)
12E-E pmrB-E123D(85) gyrA-S83L(84) uhpT-E350Q(83) blaTEM-1(74) dfrA17(71)
12E-H gyrA-S83L(100) gyrA-D87N(99) parC-S80I(99) uhpT-E350Q(88) pmrB-E123D(86)
12H-N pmrB-E123D(98) uhpT-E350Q(96) gyrA-S83L(95) gyrA-D87N(91) parC-S80I(90)
12H-E pmrB-E123D(83) blaTEM-1(78) glpT-E448K(76) uhpT-E350Q(71) gyrA-S83L(65)
12H-H pmrB-E123D(82) uhpT-E350Q(76) marR-S3N(62) blaTEM-1(58) gyrA-S83L(55)

Observing Table 5.4, a dependency similar to what was found in PC2 between two
groups of ARGs is not as apparent for PC3. The diversity of ARGs in the points
of interest is higher than in PC1 and PC2, with the most consistent ARG being
glpT-E448K in the Top-group similar to PC1. However, as discussed previously,
the frequent nature of this ARG complicates any further analysis of it being an
important factor for PC3. As can be seen in Table 5.4, glpT-E448K can also be
found in the Bot-group with lower counts. The ARG grouping of gyrA / parC point
mutations seen in PC2 is common as well in PC3. However, the consistency of how
common the ARG group is between models is not as apparent as in PC2. Moreover,
the segregation between gyrA / parC and the aph ARGs seen in PC2 is not present
in PC3. This indicates that the gyrA / parC mutations might be influential to PC3
but not as important as previously studied in PC2.

An ARG that may be more influential to PC3 however is the point mutation pmrB-
E123D which is prevalent in the Bot-group, seen in Table 5.4. This ARG is common
for five out of six tested models, also being the most common ARG for all models
fine-tuned with a 60% masking percentage. Interestingly, another point mutation of
pmrB is alo common for five out of six models in the Top-group. The relationship
between the point mutations pmrB-E123D and pmrB-Y358N was studied by plot-
ting the CLS -tokens in PC1- and PC3 with additional colouring dependent on the
ARGs in each isolate. One colour was given to all the isolates carrying pmrB-E123D
and one colour was given to all isolates carrying pmrB-Y358N. In addition, separate

27


5. Results & Discussion

colours were also given to the isolates carrying both ARGs- or none of them. The
resulting scatter plots can be found in Figure 5.8

(a) Easy FT, No PT. (b) Easy FT, Easy PT. (c) Easy FT, Hard PT.

(d) Hard FT, No PT. (e) Hard FT, Easy PT. (f) Hard FT, Hard FT.

Figure 5.8: Scatter plot of PC1 & PC3 for each CLS token from the tested models.
Each CLS token for all subplots has been coloured depending on the ARG content of
the sequence. Isolates with none of the studied point mutations (beige), isolates with
pmrB-Y38N (aqua), isolates with pmrB-E123D (navy) and isolates with all studied
point mutations (black) are marked accordingly. Colour palettes are consistent
between subfigures. Each subplot is captioned with the settings used in training to
produce the model.

Observing Figure 5.8, a dependency between the two point mutations of pmrB can
be seen in PC3 for all tested models. Data points carrying pmrB-Y38N are mostly
found on the positive PC3-axis in comparison to data points with pmrB-E123D
located more towards the negative side. This trend in PC3 suggests that the pmrB-
mutations are an important basis for the decisions the model makes. The segregation
of data dependent on pmrB is not ideal as regions with overlap exist, and unlike
the case of the ARG groupings in PC2, these overlapping regions do not contain
a combination of the studied ARGs. In fact, Figure 5.8 shows that no isolates in
the data set carry both the studied pmrB-mutations. The hypothesis regarding this
observation is that the pmrB gene in E. coli does not benefit from the amino acid
change from both mutations simultaneously whereas having one of them can be
advantageous. Similar observations have been made when studying how different

28


5. Results & Discussion

numbers of amino acid changes affect the structure of proteins [40].

The functionality of the pmrB gene is connected to reducing negative charges in
the outer monolayer of the membrane in the bacteria. Mutations in the gene have
been studied in relation to resistance to Colistin, an antibiotic in the class Polymyx-
ins which often is regarded as a last resort option due to the possible side effects
during treatment [41]. Importantly, as can be seen in Figure 4.1, due to the rare
nature of usage for Colistin the antibiotic did not meet the required thresholds to
be included in the thesis. This means that the models weigh these pmrB point
mutations for reasons which is unrelated to Colistin resistance. Reasons as to why
this phenomenon appears can only be speculative as research regarding pmrB mu-
tations is centered around resistance to polymyxins which can not be the case for
this thesis as no antibiotics for this class are available. This leads to the idea that
point mutations in the pmrB gene may play a larger role in antibiotic resistance
than previously anticipated by potentially facilitating other ARGs. Relationships
between pmrB point mutations and other ARGs could possibly be found in the ob-
served overlapping regions in the PC3 plots in Figure 5.8 as the overlapping regions
themselves indicate that PC3 is most likely dependent on something additional to
the pmrB mutations.

5.5 Future work
Further experimentation with how to better tailor models to the problem and the
data at hand was not conducted due to time constraints. Studying the potential
impact of other types of loss functions, varying the number of attention heads in
encoder blocks and such implementations has on even rarer antibiotics which did
not meet the requirement set in the thesis could be beneficial. In general, the goal
should be to introduce more antibiotics to the models from a clinical perspective to
give a more well-rounded treatment plan for patients. The thesis saw an increase
in sensitivity for a number of antibiotics with rare occurrences of resistance when
model complexity increased. Pushing the model complexity even further with more
encoder blocks could see resistance to the rare antibiotics being detected even bet-
ter up to a possible threshold. Furthermore, the positive impact of pre-training was
observed in the thesis. As discussed in Chapter 5.1, the models had the potential to
be pre-trained even longer as the training was forcefully stopped after 100 epochs.
Letting the pre-trained models train until the early stopping criteria is met may be
resource demanding but could improve results and give a clearer answer to which
type of pre-training is more beneficial to the system. Lastly, the thesis only studied
data from the species E. coli. Branching out and studying genomic data from more
than one species opens the possibility for the model to be able to accurately predict
the type of species encountered in a clinical setting as well as the types of antibiotics
the individuals are resistant towards.

The thesis highlighted some of the ARGs the models deemed as more important
when predicting resistance profiles. The reason as to why the model weights these
ARGs so highly compared to other ARGs in the studied data set is yet to be known.

29


5. Results & Discussion

Understanding why the relationship between gyrA / parC and the aph ARGs is so
important or what point mutations in pmrB provide for resistance profiles not related
to polymyxins may be instrumental to gain a deeper understanding of antibiotic
resistance. However, this gained knowledge of what the models are trying to teach
us is something that most likely needs to be further studied in a laboratory settings.

30


6
Conclusion

The thesis aimed to implement a transformer-based AI method that utilizes genomic
data to predict resistance to commonly used antibiotics. The models that were cre-
ated using this methodology managed to correctly predict the resistance profiles of
bacteria upwards to 94% depending on the selection of hyperparameters- and model
complexity. As the R/S percentages for all antibiotics in the data set used for the
thesis were unbalanced even after introducing exclusion criteria, further analysis
with different metrics highlighted the importance of a high model complexity. In
particular, certain classes of antibiotics like carbapenems where resistance are a rare
occurrence in the data set only captured by the most complex models tested. The
thesis concluded that the use of 12 encoder blocks is beneficial for the model and
encourages future studies regarding increasing the model complexity even further.
Moreover, the thesis can also conclude that the use of self-supervised pre-training
with genetic data improves the ability of a complex model to accurately find re-
sistance profiles in bacteria. Even though models with both types of pre-training
consistently performed better than models without, the type of pre-training that
performed better was situational which means that no conclusion was drawn on
which type should be preferred.

After studying the disparity of the CLS -tokens produced by trained models, learn-
ing patterns revealed important aspects of what the decisions are based on. To
begin with, the most important characteristic of the data to the model is how many
ARGs are present in a sequence. This fundamental aspect of the data influences
most of the decisions made by the models. Moreover, the lower dimensions of the
PCA highlighted how certain ARGs are more influential in the decisions the model
makes than others. One such finding in the thesis was a relationship between two
groups of ARGs: gyrA / parC encoding resistance to flouroquinolone and aph(3”)-Ib
/ aph(6)-Id encoding resistance to aminoglycosides. Furthermore, the point muta-
tions E123D- and Y358N in the gene pmrB also showed significant influence in the
decisions made by the models. These genetic findings that are based on patterns
the models have found are of interest to understanding antibiotic resistance better
and are encouraged to be further studied in a laboratory setting.

31


6. Conclusion

32


Bibliography

[1] S. Dattani, L. Rodés-Guirao, H. Ritchie, E. Ortiz-Ospina, M.
Roser. Life Expectancy. Our World in Data. 2023. Available from:
https://ourworldindata.org/life-expectancy

[2] J. Inda-Díaz. New AI-based methods for studying antibiotic-resistant bacteria.
Gothenburg: University of Gothenburg, Faculty of Science; 2023. Available
from: https://hdl.handle.net/2077/78675

[3] W.A. Adedeji. The treasure called antibiotics. Ann Ib Postgrad Med. 2016
Dec;14(2):56-57. PMID: 28337088; PMCID: PMC5354621.

[4] What is the total consumption of antibiotics? [Internet]. Federal Of-
fice of Public Health (FOPH). [updated 06.12.2023]. Available from:
https://www.bag.admin.ch/bag/en/home/krankheiten/infektionskrankheiten-
bekaempfen/antibiotikaresistenzen/wie-viele-antibiotika-verbrauchen-wir—
.html

[5] M. Lobanovska, G. Pilla. Penicillin’s Discovery and Antibiotic Resistance:
Lessons for the Future? YALE JOURNAL OF BIOLOGY AND MEDICINE .
2017 Mar 29;90(1):135-145. PMID: 28356901; PMCID: PMC5369031

[6] Antimicrobial resistance [Internet]. World Health Organization (WHO). [up-
dated 21.11.2023]. Available from: https://www.who.int/news-room/fact-
sheets/detail/antimicrobial-resistance

[7] M. Naghavi, et al.. Global burden of bacterial antimicrobial resistance in 2019:
a systematic analysis. Lancet. 2022 Feb 12–18;397(10274):533-43.

[8] World Bank. Drug-Resistant Infections: A Threat to Our Economic Future.
Washington, DC: World Bank; 2017. License: Creative Commons Attribution
CC BY 3.0 IGO.

[9] G. Cox, G.D. Wright. Intrinsic antibiotic resistance: Mechanisms, ori-
gins, challenges and solutions. Int J Med Microbiol. 2013;303(6-7):287-292.
doi:10.1016/j.ijmm.2013.02.009.

[10] E. Peterson, P. Kaur. Antibiotic resistance mechanisms in bacteria: relation-
ships between resistance determinants of antibiotic producers, environmental
bacteria, and clinical pathogens. Frontiers in Microbiology. 2018 Nov 30;9:2928.
doi: 10.3389/fmicb.2018.02928.

[11] A. MacGowan, E. Macnaughton. Antibiotic resistance. Medicine.
2017;45(10):622-628. doi: 10.1016/j.mpmed.2017.07.006.

[12] M. Frieri, K. Kumar, A. Boutin. Antibiotic resistance. J Infect Public Health.
2017;10(4):369-378. doi: 10.1016/j.jiph.2016.08.007.

[13] I. Gajic, J. Kabic, D. Kekic, M. Jovićević, M. Milenkovic, D. Mitić-Ćulafić,
A. Trudic, L. Ranin, N. Opavski. Antimicrobial susceptibility testing: a com-

33


Bibliography

prehensive review of currently used methods. Antibiotics. 2022 Mar;11:427. doi:
10.3390/antibiotics11040427.

[14] M.F. Anjum, E. Zankari, H. Hasman. Molecular methods for de-
tection of antimicrobial resistance. Microbiol Spectrum. 2017;5(6). doi:
10.1128/microbiolspec.arba-0011-2017.

[15] C. Köser. Whole-genome sequencing to control antimicrobial resistance. Trends
Genet. 2014;30(9)

[16] T. Davenport, R. Kalakota. The potential for artificial intelligence in healthcare.
Future Healthcare Journal. 2019 Jun;6(2):94-98. doi: 10.7861/futurehosp.6-2-94

[17] R. Forghani, ed. Machine Learning and Other Artificial Intelligence Applica-
tions, An Issue of Neuroimaging Clinics of North America, E-Book. Elsevier
Health Sciences; 2020. ISBN: 9780323712453.

[18] V. Kaul, S. Enslin, S.A. Gross. History of artificial intelligence in medicine.
Gastrointestinal Endoscopy 2020;92(4):807-812. doi: 10.1016/j.gie.2020.06.040

[19] I. Goodfellow, Y. Bengio, A. Courville. Deep Learning. MIT Press; 2016.
[20] A. Gillioz, J. Casas, E. Mugellini, O. Abou Khaled. Overview of the

Transformer-based Models for NLP Tasks. In: Proceedings of the Federated
Conference on Computer Science and Information Systems. 2020:179-183. doi:
10.15439/2020F20.

[21] A. Vaswani, et al.. Attention Is All You Need. In: 31st Conference on Neural
Information Processing Systems (NIPS 2017); Long Beach, CA, USA.

[22] J.K. Tripathy, et al.. Comprehensive analysis of embeddings and pre-
training in NLP. Computer Science Review. 2021;42:100433. doi:
10.1016/j.cosrev.2021.100433.

[23] O. Ozyegen, H. Jahanshahi, M. Cevik, et al. Classifying multi-level product
categories using dynamic masking and transformer models. J Data Inf Manag.
2022;4:71-85. doi: 10.1007/s42488-022-00066-6.

[24] R. Irwin, S. Dimitriadis, J. He, E.J. Bjerrum. Chemformer: a pre-trained trans-
former for computational chemistry. Machine Learning: Science and Technol-
ogy. 2022;3(1):015022. doi: 10.1088/2632-2153/ac3ffb.

[25] S. Jamil, M. Jalil Piran, O.J. Kwon. A Comprehensive Survey of Transformers
for Computer Vision. Drones. 2023;7(5):287. doi:10.3390/drones7050287

[26] K.M. Ruff, R.V. Pappu. AlphaFold and implications for intrinsically disor-
dered proteins. JMB Journal of Molecular Biology. 2021;433(20):167208. doi:
10.1016/j.jmb.2021.167208.

[27] C. Zhang, S. Liwicki, R. Cipolla. Image reranking using pretrained vision trans-
formers. Cambridge Research Lab. 2022; University of Cambridge, UK.

[28] P. Shaw, J. Uszkoreit, A. Vaswani. Self-attention with relative po-
sition representations. In: North American Chapter of the As-
sociation for Computational Linguistics; 2018. Available from:
https://api.semanticscholar.org/CorpusID:3725815.

[29] L. Liu, J. Liu, J. Han. Multi-head or Single-head? An Empirical Comparison
for Transformer Training. arXiv:2106.09650 [cs.CL]. 2021.

[30] J. Devlin, M.W. Chang, K. Lee, K. Toutanova. BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. In: Proceedings of the
2019 Conference of the North American Chapter of the Association for Compu-

34


Bibliography

tational Linguistics: Human Language Technologies; 2019. p. 4171–4186. doi:
10.18653/v1/N19-1423.

[31] J.L. Ba, J.R. Kiros, G.E. Hinton. Layer Normalization. arXiv:1607.06450
[stat.ML]. 2016.

[32] X. Ying. An Overview of Overfitting and its Solutions. Journal of Physics.
2019;1168:022022. DOI: 10.1088/1742-6596/1168/2/022022.

[33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Jour-
nal of Machine Learning Research. 2014;15:1929-1958.

[34] I.T. Jolliffe, J. Cadima. Principal component analysis: a review and
recent developments. Phil. Trans. R. Soc. A. 2016;374:20150202. DOI:
10.1098/rsta.2015.0202.

[35] F.L. Gewers, G.R. Ferreira, H.F. De Arruda, F.N. Silva, C.H. Comin,
D.R. Amancio, D.R. Costa. Principal Component Analysis: A Natural Ap-
proach to Data Exploration. ACM Comput. Surv. 2021;54(4):70:1-34. DOI:
10.1145/3447755.

[36] T.F. Monaghan, S.N. Rahman, C.W. Agudelo, A.J. Wein, J.M. Lazar, K. Ev-
eraert, R.R. Dmochowski. Foundational Statistical Principles in Medical Re-
search: Sensitivity, Specificity, Positive Predictive Value, and Negative Pre-
dictive Value. Medicina (Kaunas). 2021 May;57(5):503. DOI: 10.3390/medic-
ina57050503. PMCID: PMC8156826. PMID: 34065637.

[37] A. Johnning, E. Kristiansson, J. Fick, B. Weijdegård, D.J.G. Larsson. Resis-
tance Mutations in gyrA and parC are Common in Escherichia Communities
of both Fluoroquinolone-Polluted and Uncontaminated Aquatic Environ-
ments. Front Microbiol. 2015;6. doi: 10.3389/fmicb.2015.01355. Available from:
https://www.frontiersin.org/journals/microbiology/articles/10.3389/fmicb.2015.01355

[38] M. Woegerbauer, J. Zeinzinger, B. Springer, P. Hufnagl, A. Indra, I. Korschi-
neck, et al. Prevalence of the aminoglycoside phosphotransferase genes aph(3)-
IIIa and aph(3)-IIa in Escherichia coli, Enterococcus faecalis, Enterococcus
faecium, Pseudomonas aeruginosa, Salmonella enterica subsp. enterica and
Staphylococcus aureus isolates in Austria. J Med Microbiol. 2014;63(2).

[39] M. Rezazadeh, H. Baghchesaraei, A. Peymani. Plasmid-Mediated
Quinolone-Resistance (qnr) Genes in Clinical Isolates of Es-
cherichia coli Collected from Several Hospitals of Qazvin and Zanjan
Provinces, Iran. Osong Public Health Res Perspect. 2016;7(5):307-
312. doi: https://doi.org/10.1016/j.phrp.2016.08.003. Available from:
https://www.sciencedirect.com/science/article/pii/S2210909916300984

[40] R. Dorantes-Gilardi, L. Bourgeat, L. Pacini, L. Vuillon, C. Lesieur. In
proteins, the structural responses of a position to mutation rely on the
Goldilocks principle: not too many links, not too few. Phys Chem Chem
Phys. 2018;20(39):25399-25410. doi:10.1039/C8CP04530E. Available from:
http://dx.doi.org/10.1039/C8CP04530E

[41] L.M. Lim, N. Ly, D. Anderson, J.C. Yang, L. Macander, A. Jarkowski, A. For-
rest, J.B. Bulitta, B.T. Tsuji. Resurgence of Colistin: A Review of Resistance,
Toxicity, Pharmacodynamics, and Dosing. Pharmacotherapy. 2010;30(12):1279-
1291. doi:10.1592/phco.30.12.1279.

35


Bibliography

36


A
Appendix 1

Table A.1: Recorded sensitivity for each model- and antibiotic in the study for
easy fine-tuning (E). The model names are listed as the number of encoder blocks
followed by the masking percentage for the type of fine-tuning- and pre-training
used. The type of pre-training was either easy (E), hard (H) or no pre-training (N).

Ab name 1E-N 1E-E 1E-H 3E-N 3E-E 3E-H 12E-N 12E-E 12E-H
AMC 0.025 0.025 0.050 0.238 0.325 0.388 0.438 0.438 0.488
AMP 0.832 0.813 0.857 0.833 0.875 0.891 0.857 0.873 0.895
AZT 0.696 0.625 0.696 0.750 0.750 0.786 0.714 0.732 0.821
CZL 0.764 0.734 0.779 0.769 0.814 0.794 0.744 0.799 0.829
CPM 0.179 0.205 0.291 0.376 0.504 0.504 0.538 0.590 0.709
CTX 1.000 1.000 1.000 0.994 0.987 0.981 0.974 0.987 0.968
CXT 0.518 0.430 0.439 0.623 0.675 0.535 0.667 0.667 0.702
CPD 0.963 0.963 1.000 1.000 1.000 1.000 1.000 1.000 1.000
CTZ 0.682 0.611 0.753 0.702 0.778 0.783 0.712 0.758 0.798
CTR 0.741 0.737 0.777 0.785 0.839 0.834 0.741 0.805 0.878
CMP 0.041 0.245 0.265 0.429 0.551 0.673 0.592 0.694 0.796
CIP 0.772 0.873 0.826 0.868 0.886 0.873 0.886 0.886 0.921
ETP 0.000 0.000 0.000 0.000 0.000 0.000 0.154 0.154 0.154
GEN 0.362 0.329 0.396 0.577 0.483 0.530 0.584 0.758 0.725
IMI 0.000 0.000 0.000 0.000 0.000 0.000 0.149 0.286 0.286
LVX 0.889 0.959 0.930 0.965 0.982 0.965 0.965 0.959 0.977
MEM 0.000 0.000 0.000 0.000 0.000 0.000 0.149 0.429 0.429
STR 0.783 0.807 0.843 0.873 0.922 0.886 0.861 0.896 0.896
SXT 1.000 0.976 1.000 1.000 1.000 0.976 1.000 1.000 0.976
INN 0.836 0.796 0.829 0.855 0.875 0.862 0.875 0.882 0.875
TET 0.848 0.861 0.842 0.879 0.854 0.844 0.861 0.840 0.855
TOB 0.096 0.220 0.341 0.659 0.585 0.634 0.537 0.683 0.756
TRI 0.781 0.625 0.906 0.936 0.968 0.938 0.938 0.781 1.000
TMP 0.699 0.668 0.695 0.767 0.803 0.756 0.789 0.892 0.874

I


A. Appendix 1

Table A.2: Recorded sensitivity for each model- and antibiotic in the study for
hard fine-tuning (H). The model names are listed as the number of encoder blocks
followed by the masking percentage for the type of fine-tuning- and pre-training
used. The type of pre-training was either easy (E), hard (H) or no pre-training (N)..

Ab name 1H-N 1H-E 1H-H 3H-N 3H-E 3H-H 12H-N 12H-E 12H-H
AMC 0.000 0.000 0.000 0.125 0.200 0.088 0.338 0.288 0.338
AMP 0.781 0.783 0.795 0.770 0.857 0.821 0.804 0.799 0.826
AZT 0.589 0.536 0.571 0.714 0.679 0.696 0.643 0.643 0.750
CZL 0.709 0.739 0.729 0.734 0.789 0.784 0.729 0.754 0.769
CPM 0.111 0.051 0.137 0.256 0.282 0.325 0.282 0.402 0.564
CTX 1.000 1.000 1.000 1.000 1.000 1.000 0.994 0.968 0.974
CXT 0.281 0.175 0.298 0.395 0.500 0.447 0.518 0.412 0.553
CPD 1.000 1.000 0.963 0.963 1.000 1.000 0.926 1.000 0.926
CTZ 0.551 0.510 0.561 0.586 0.657 0.652 0.652 0.591 0.763
CTR 0.673 0.693 0.712 0.732 0.766 0.790 0.698 0.766 0.824
CMP 0.041 0.061 0.184 0.286 0.510 0.490 0.429 0.653 0.531
CIP 0.798 0.759 0.768 0.798 0.890 0.860 0.785 0.829 0.864
ETP 0.000 0.000 0.000 0.000 0.000 0.000 0.076 0.231 0.154
GEN 0.161 0.154 0.221 0.342 0.396 0.362 0.409 0.490 0.456
IMI 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.143 0.286
LVX 0.901 0.912 0.906 0.912 0.959 0.959 0.912 0.918 0.942
MEM 0.000 0.000 0.000 0.000 0.000 0.000 0.143 0.143 0.286
STR 0.729 0.741 0.771 0.837 0.873 0.867 0.849 0.867 0.855
SXT 1.000 0.952 0.976 0.952 1.000 1.000 1.000 1.000 0.976
INN 0.684 0.651 0.697 0.730 0.842 0.822 0.816 0.829 0.835
TET 0.775 0.754 0.770 0.826 0.802 0.824 0.748 0.799 0.840
TOB 0.121 0.073 0.073 0.341 0.390 0.439 0.341 0.512 0.707
TRI 0.875 0.906 0.936 0.938 0.938 0.938 0.906 0.906 1.000
TMP 0.596 0.601 0.601 0.704 0.758 0.758 0.704 0.767 0.789

Specificity tabeller:

Table A.3: Recorded specificity for each model- and antibiotic in the study for
easy fine-tuning (E). The model names are listed as the number of encoder blocks
followed by the masking percentage for the type of fine-tuning- and pre-training
used. The type of pre-training was either easy (E), hard (H) or no pre-training (N)..

Ab name 1H-N 1H-E 1H-H 3H-N 3H-E 3H-H 12H-N 12H-E 12H-H
AMC 0.998 1.000 1.000 0.998 0.991 0.994 0.995 0.997 0.991
AMP 0.866 0.863 0.804 0.908 0.834 0.850 0.912 0.920 0.896
AZT 0.911 0.936 0.950 0.928 0.933 0.958 0.963 0.973 0.953
CZL 0.807 0.832 0.819 0.863 0.826 0.832 0.863 0.891 0.844
CPM 0.970 0.966 0.931 0.937 0.916 0.912 0.927 0.929 0.909
CTX 0.000 0.000 0.000 0.000 0.062 0.062 0.281 0.438 0.344

Continued on next page

II


A. Appendix 1

Table A.3 – continued from previous page
Ab name 1E-N 1E-E 1E-H 3E-N 3E-E 3E-H 12E-N 12E-E 12E-H
CXT 0.948 0.935 0.935 0.925 0.935 0.934 0.971 0.950 0.951
CPD 0.000 0.417 0.250 0.667 0.667 0.583 0.750 1.000 0.750
CTZ 0.848 0.850 0.838 0.894 0.896 0.888 0.909 0.929 0.909
CTR 0.903 0.914 0.906 0.928 0.931 0.932 0.953 0.964 0.928
CMP 0.998 0.983 0.987 0.985 0.983 0.985 0.987 0.990 0.980
CIP 0.994 0.977 0.975 0.978 0.979 0.986 0.983 0.986 0.973
ETP 1.000 1.000 1.000 1.000 1.000 1.000 0.997 0.998 0.998
GEN 0.982 0.967 0.962 0.968 0.980 0.977 0.988 0.981 0.985
IMI 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.998 1.000
LVX 0.979 0.933 0.933 0.954 0.964 0.959 0.979 0.979 0.959
MEM 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.999
STR 0.926 0.924 0.914 0.907 0.951 0.963 0.934 0.951 0.961
SXT 0.613 0.806 0.742 0.645 0.871 0.903 0.871 0.903 0.806
INN 0.908 0.908 0.905 0.915 0.931 0.938 0.945 0.969 0.955
TET 0.858 0.890 0.804 0.852 0.917 0.909 0.826 0.935 0.934
TOB 1.000 0.998 0.996 0.977 0.990 0.992 0.988 0.985 0.985
TRI 0.707 0.780 0.707 0.707 0.829 0.780 0.805 0.854 0.854
TMP 0.927 0.950 0.943 0.945 0.959 0.961 0.969 0.977 0.963

Table A.4: Recorded specificity for each model- and antibiotic in the study for
hard fine-tuning (H). The model names are listed as the number of encoder blocks
followed by the masking percentage for the type of fine-tuning- and