BL(B)ERT: Predicting Antibiotic
Resistance with a Language AI Model
Applying a BERT Language Model to Predict Antibiotic Resis-
tance Within Beta-lactamase/transpeptidase-like Proteins

Master’s thesis in Complex Adaptive Systems

MATHIAS ÖRTENBERG TOFTÅS

DEPARTMENT OF MATHEMATICAL SCIENCES

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2022
www.chalmers.se

www.chalmers.se


Master’s thesis 2022

BL(B)ERT: Predicting Antibiotic
Resistance with a Language AI Model

Applying a BERT Language Model to Predict Antibiotic Resistance
Within Beta-lactamase/transpeptidase-like Proteins

MATHIAS ÖRTENBERG TOFTÅS

Department of Mathematical Sciences
Chalmers University of Technology

Gothenburg, Sweden 2022


BL(B)ERT: Predicting Antibiotic Resistance with a Language AI Model
Applying a BERT Language Model to Predict Antibiotic Resistance Within Beta-
lactamase/transpeptidase-like Proteins
MATHIAS ÖRTENBERG TOFTÅS

© MATHIAS ÖRTENBERG TOFTÅS, 2022.

Supervisor: Erik Kristiansson, Chalmers, Mathematical Sciences
Examiner: Erik Kristiansson, Mathematical Sciences

Master’s Thesis 2022
Department of Mathematical Sciences
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Figure illustrating the classification of a protein with an AI model, the
protein in the figure is used under a Creative Commons Attribution-Share Alike
license. The figure has been modified, and the original can be found here [1].

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2022

iv


BL(B)ERT: Predicting Antibiotic Resistance with a Language AI Model
Applying a BERT Language Model to Predict Antibiotic Resistance Within Beta-
lactamase/transpeptidase-like Proteins
MATHIAS ÖRTENBERG TOFTÅS
Department of Mathematical Sciences
Chalmers University of Technology

Abstract
The effectiveness of utilizing a BERT language AI model to predict antibiotic resis-
tance within beta-lactamase/transpeptidase-like proteins is tested. The performance
of the model is compared to a traditional Hidden Markov Model (HMM) to verify its
capabilities. To further ensure the models capabilities, several tests such as cross-
validation and resistance towards sequence read errors are performed. We find that
the BERT model outperforms the HMMmodel on amino acid sequences with lengths
around 40-50 and shorter. These are the type of sequence lengths which we expect
to encounter in our use case. For longer sequences, the HMM model is preferable
as it requires less computation time. We also find that the BERT model and the
HMM model has learned different aspects about the data, allowing the combination
of the results of both methods to achieve even finer results.

Keywords: Antibiotic Resistance, BERT, Beta-lactamase/transpeptidase-like, Hid-
den Markov Model.

v


Acknowledgements
I would like to thank my supervisor Erik Kristiansson and his colleagues Fanny
Berglund and Jaun Salvador Inda Diaz for their continuous support and help during
this spring. Without their subject area expertise and insights into what would be
interesting to delve deeper into, this project would likely not have gone further than
the training of the model. I would also like to thank David Lund for his work with
procuring the data used in this work.

Mathias Örtenberg Toftås, Gothenburg, June 2022

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

BERT Bidirectional Encoder Representations from Transformers
HMM Hidden Markov Model
FFN Feed Forward Neural Network
ROC Receiver Operating Characteristic
CLS Class Token
MSK Mask Token
PAD Padding Token
CPU Central Processing Unit
GPU Graphical Processing Unit
DNA Deoxyribonucleic Acid
mRNA Messenger Ribonucleic Acid
tRNA Transfer Ribonucleic Acid

ix


Nomenclature

Below is the nomenclature of indices, sets, parameters, and variables that have been
used throughout this thesis.

Parameters

dmodel Hidden Dimension Size
lmax Maximum Sequence Length
h Number of Attention Heads
dself−attention Self-Attention Internal Size
dF F N FFN Internal Size
ddepth Number of Stacked Encoders
rdrop Dropout Rate

xi


xii


Contents

List of Acronyms ix

Nomenclature xi

List of Figures xv

List of Tables xix

1 Introduction 1

2 Theory 3
2.1 Proteins Synthesis and DNA Sequencing . . . . . . . . . . . . . . . . 3
2.2 Protein Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Methods 11
3.1 The Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 The BERT Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Results 19
4.1 Comparing with Previous Methods . . . . . . . . . . . . . . . . . . . 19
4.2 Testing Training Transferability . . . . . . . . . . . . . . . . . . . . . 25
4.3 Cross-Validating the Model . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Timing the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 Glimpsing into the Model . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Testing on Metegenomic Reads . . . . . . . . . . . . . . . . . . . . . 30
4.7 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 32

Bibliography 35

xiii


Contents

xiv


List of Figures

2.1 Graphic representing protein synthesis. . . . . . . . . . . . . . . . . . 3
2.2 Translation of codons into amino acids. To arrive at the correspond-

ing amino acid, begin at the center of the figure with the first nucleic
acid of the codon, then follow the graphic outwards. . . . . . . . . . . 4

2.3 Figure demonstrating a proteins structure and its active site. The
figure text font has been altered from the original, and a link to the
original can be found here [8]. . . . . . . . . . . . . . . . . . . . . . . 5

2.4 Figure demonstrating how different amino acid sequences can be
folded to have similar shapes. In the figure, the blue blocks rep-
resent the amino acids part of the active site. The ’folding’ in the
figure consists of taking segments of 3 amino acids and rotating them
freely to achieve the same result. . . . . . . . . . . . . . . . . . . . . 6

2.5 A single Encoder block. The Encoder consists of two sub units, the
Multi-Head Attention and the Feed Forward network. Each sub units
output is subjected to dropout (not shown) added to the original input
and then layer normalized before being feed forward. . . . . . . . . . 8

2.6 The components within the Encoder. Not shown in the figure is the
dropout that occurs on the output of the softmax layer before the
final matrix multiplication in the Scaled Dot-Product Attention. . . . 9

3.1 Visual representation of the various data sets used in this work. a
represents all bacterial DNA, b presents our specific family of bacte-
ria, c represents DNA within this family which has been labelled by
using a HMM model, and d represents our set of true labelled DNA. . 11

3.2 Visual representation grouping of data to perform the cross-validation.
The dark orange blocks represent the original validation data, the yel-
low blocks represent the original training data. The pluses and mi-
nuses represent the sets of antibiotic resistant positive proteins and
negative ones. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 The embedding of sequences into vectors, the embedding takes both
the token and the position into account. . . . . . . . . . . . . . . . . 13

3.4 The pre-training step. Every sequence gets modified by masking or
changing some of the input tokens, the model is then tasked to predict
the original token. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

xv


List of Figures

3.5 Pre-training runs performed whilst varying a single parameter, these
runs ended up taking on the order of 1/20th the time it took for the
final pre-training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.6 The loss function as it evolves over time during the pre-training, the
shift from periodic to non periodic behavior around batch 300000 is
due to a change from fixed to random batch order. . . . . . . . . . . 16

3.7 A graphic representation of the fine-tuning procedure. . . . . . . . . . 17
3.8 The loss and accuracy of the fine-tuning. . . . . . . . . . . . . . . . . 17

4.1 a) ROC curve comparing the predictive capabilities of the BERT
and HMM models. The curve shown is averaged across all sequence
lengths (10-63). b) The area under the ROC curve as a function of
the protein sequence length. . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 A variety of metrics comparing the BERT and HMM models for dif-
ferent sequence lengths. The optimal cutoff threshold used was ob-
tained by selecting the threshold corresponding to the point of the
ROC curve (Fig. 4.1) closest to the upper left corner. . . . . . . . . . 21

4.3 The figures show the distribution of proteins in the 2D space of BERT
and HMM scores. The dashed lines correspond to the cutoff thresh-
olds of the two models, and the blue and red distributions correspond
to the negative and positive proteins respectively. All figures are
shown in the same linear scale. a) Shows sequence lengths between
10 and 17, b) shows between 18 and 35, c) shows between 36 and 53,
and d) shows for all sequence lengths. . . . . . . . . . . . . . . . . . 22

4.4 The corresponding 1D distributions to those shown in Fig. 4.3. The
left column shows the distributions along the HMM axis and the
right shows them along the BERT axis. As in the previous figure, a)
shows the short sequences, b) the medium ones, c) the long ones, and
d) shows all lengths. Again, the red distribution shows the positive
proteins whilst the blue shows the negative proteins. . . . . . . . . . . 23

4.5 The degree of separation between the two distributions in the sense
of measuring the area of overlap between the two distributions. For
the case of a perfect classifier the separation would be 1, meaning
no overlap, whilst a separation of 0 means that the two distributions
overlap completely. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.6 a) ROC showing the performance of the BERT and HMM models
when the proteins have had random errors introduced. The ROC
shown has been averaged across all chosen error rates. b) The area
under the ROC for different error rates. . . . . . . . . . . . . . . . . . 24

4.7 Various metrics to compare the BERT and HMM model when oper-
ating on proteins with introduced read errors. . . . . . . . . . . . . . 25

4.8 ROC showing the difference between the standard model and the
cross-validated one. The left figure shows the ROC curves for se-
quences of various amino acid lengths, and the right figure shows the
area beneath the curves to show how the models perform for various
sequence lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xvi


List of Figures

4.9 Various metrics comparing the standard and cross-validated versions
of the BERT model. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.10 The graphs show how important each part of the protein is in or-
der to classify it as antibiotic resistant. Each protein shown is a
known antibiotic resistant protein. The dashed line shows the opti-
mal threshold, and the graphs for each model are aligned such that
their thresholds overlap. . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.11 PCA plots of the [CLS] token output of the BERT model. The left
column shows the two distributions (red for positive and blue for
negative proteins) when the model is correct, whilst the right column
shows when the model is incorrect. The top row shows the BERT
models classification and the bottom row shows the same for the
HMM model. All plots are made in the same linear scale. . . . . . . . 29

4.12 The corresponding distributions to the 2D distributions in Fig. 4.11. . 30
4.13 Graph showing the model output on unlabeled metagenomic reads.

The insert shows the sequence length distribution. The dashed lines
shows the cutoff thresholds obtained previously. The samples shown
are three different metegenomic samples from the human gut. The
vast majority of reads scored below the HMM cutoff, these are not
included in the figure . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.14 More metegenomic reads, this time from waste water treatment plants
from various countries, a) is from India, b) Senegal, and c) Sweden. . 32

xvii


List of Figures

xviii


List of Tables

3.1 The selected model parameters. . . . . . . . . . . . . . . . . . . . . . 14

xix


List of Tables

xx


1
Introduction

According to the World Health Organization, "antibiotic resistance is one of the
biggest threats to global health, food security, and development today" [2]. Further,
the American government agency Centers for Disease Control and Prevention at-
tribute nearly 50000 preventable deaths due to antibiotic resistance in 2019 [3]. We
can extrapolate that figure to encompass the entire world and arrive at 1.2 million
preventable deaths which agrees with values found by other studies [4]. Comparing
this to the most common causes of death, such as stroke with roughly 6 million
deaths per year, we find that it is within the same order of magnitude [5]. With this
in mind it should be clear why any kind of research into antibiotic resistance is of
value.

A particular problem when it comes to researching antibiotic resistance lies in ac-
quiring samples of the bacterial genes that give rise to antibiotic resistance. The
issue is two fold: firstly, it is not clear which part of a bacteria’s DNA which gives
rise to its antibiotic resistance, secondly, we wish to identify these genes before its
corresponding bacteria becomes so wide spread that identification becomes trivial.
There exists several traditional methods that perform this, but these usually depend
on performing sequence alignments which limits the kind of genes they can discover.

In recent times, the application of various AI methods within the field of molecu-
lar biology has become both popular and widely successful. These methods have
shown the capability of performing feats previously thought computationally nearly
impossible such as AlphaFold’s capabilities within protein folding [6]. They are also
known for their abilities to learn general truths about the data which cannot be
neatly written down in simple formulas. Thus, it seems prudent to see if any such
AI method is capable of identifying antibiotic genes.

However, AI methods generally require large data sets to train. This is a big prob-
lem as we lack such large data sets and it is precisely for this reason why we wish
to utilize AI methods; we wish to create these data sets by identifying new DNA
sequences. Fortunately, there exists models which are capable of pre-training on
unlabelled data which only require a small amount of subject matter specific data
for the final fine-tuning.

A suitable AI model for our purposes are language models. These models are de-
signed to take a sentence as input and perform some kind of prediction. They are
suitable as one can imagine the similarities between sentences and the grammar of

1


1. Introduction

language and amino acid sequences and the grammar of proteins. For our cases we
have elected to use a recent language model known as the Bidirectional Encoder
Representations from Transformers model (BERT) [11]. This model also offers the
greatly needed capability of being pre-trainable on unlabelled data.

The purpose of this project is then to determine whether such an AI model is a
suitable choice for the identification of antibiotic microbial DNA. We will compare
its performance to that of a traditional method currently used by our supervisor
known as a Hidden Markov Model. If the model proves successful, the hope is that
our supervisor will be able to use said model in future work.

2


2
Theory

2.1 Proteins Synthesis and DNA Sequencing

In this work we mainly focus on the computational aspect of detecting certain pro-
teins, thus the area specific knowledge regarding protein structures and similar is
not required to understand this thesis. Therefore we will only cover the very basics
required to understand this work. Proteins perform a vast array of different tasks
such as catalysing reactions, transporting molecules, giving structure to cells, and
much more. The function of a protein is determined by is structure which in turn
is coded by DNA. Proteins are created in a process known as protein synthesis.
This process begins when an RNA polymerase enzyme connects with the DNA at a
transcription start site. Here it begins by unraveling the two strands making up the
DNA helix and begins reading and transcribing the DNA onto a new strand known
as mRNA. It continues doing this until it hits a stop codon, signaling that its has
transcribed a complete sub unit of the DNA. The process continues later one once
the mRNA has made contact with a ribosome, here the mRNA strand is read and
each codon corresponds to one amino acid which is brought in by the tRNA. The
ribosome connects each protein in order, building up the protein sequentially. This
process can be seen in Fig. 2.1

Amino Acids
New Protein

mRNA

tRNA

Figure 2.1: Graphic representing protein synthesis.

3


2. Theory

The most important parts of the above process for our use case is that proteins
are coded into DNA in enclosed segments separated from the rest of the DNA by
two stop codons. Thus if we wish to study specific proteins, it is possible to do so
by looking at the underlying DNA which constructs the proteins. This has been
quite successful in recent years with new AI techniques such as Alpha Fold able
to reconstruct the protein structure from the amino acid sequence [6]. In order to
do this, however, we must also translate the DNA sequence into the corresponding
amino acid sequence. Thankfully, this translation is known and we can simple look
the up in a table such as Fig. 2.2.

Figure 2.2: Translation of codons into amino acids. To arrive at the corresponding
amino acid, begin at the center of the figure with the first nucleic acid of the codon,
then follow the graphic outwards.

Unfortunately, since DNA exists in the real world, we must also contend with the
practical issue of actually sequencing the DNA. There exists several methods that
achieve this, but one thing that these methods all share in commons is that they
introduce errors in the form of read errors. These errors are usually manageable, such
as the popular method of Illumina sequencing which tends to produce error rates
below 1% [9]. We must note however that these error rates pertain to the nucleic

4


2. Theory

acids and not the amino acids. For our use case, simply one nucleic acid being read
incorrectly is enough to change an amino acid into another. Thus we find that the
probability of all nucleic acids in a codon being read correctly is 0.993 ≈ 0.97, giving
us an error rate of 3%. Because of this it would be prudent to keep in mind that
any method proposed to be used in combination with proteins should be able to
contend with such errors.

2.2 Protein Function
There exists several methods with which bacteria gain antibiotic resistance. Some
of these include the bacteria developing new processes which avoid using the the
antibiotics targets, creating enzymes that break down the antibiotic, changing the
entry ways into the bacteria to stop the antibiotic from entering the cell, and even
pumping the antibiotics out from the cell [7]. Of these methods we only tackle the
creation of enzymes since this only requires a single gene to perform.

Protein Structure

Scaffold to support and 
position active site

Active Site

Binding Sites Catalytic Site
Bind and orient
substrate(s)

Reduce chemical
activation energy

Figure 2.3: Figure demonstrating a proteins structure and its active site. The
figure text font has been altered from the original, and a link to the original can be
found here [8].

As with all proteins, it is the structure of enzymes which determines its function.
Even more specifically for enzymes, it is the active site that determines what hap-
pens when an enzyme interacts with other molecules. So long as the active site
retains the same or an equivalent shape, and the rest of the enzyme does not block
the target molecule, the enzyme will continue to function even if the other parts of
the enzyme are changed or mutated. This can be seen in Fig. 2.3.

5


2. Theory

2.3 The Model

2.3.1 Model Selection
To determine which AI model to apply, we must first understand our data. The
data that we have consists of sequences of amino acids which represent our proteins.
We know since earlier that one of the most important parts of the protein is the
active site itself, therefore our model should somehow try to capture these features
in the data. A naive method could try to identify the parts of the sequence that
corresponds to the active site. The issue with such a method lies in the folding of
proteins upon their creating. Because the proteins are folded into 3D space, the
distance between amino acids in the 1D sequence does not correspond to the actual
distance between them in 3D. Thus we expect the amino acids that make up the
active site to be separated in the sequence. Even further, it is possible to construct
the same or a similar active site with wildly varying amino acid sequences. A toy
demonstration of this can be seen in Fig. 2.4 This informs us that our model needs
to be able to take the entire sequence into account at once. Thus we can immediately
rule out any method that traverses the sequence as these cannot capture relations
between amino acids that are seemingly distant in the 1D sequence. Another issue
to take into consideration is that the length of the sequences also varies, thus the
model must also be able to handle this.

Figure 2.4: Figure demonstrating how different amino acid sequences can be folded
to have similar shapes. In the figure, the blue blocks represent the amino acids part
of the active site. The ’folding’ in the figure consists of taking segments of 3 amino
acids and rotating them freely to achieve the same result.

Taking these thoughts into consideration, we turn to language models. These mod-
els are built to understand human language. It might seem odd at first glance, but
sentences share a lot in common with our amino acid sequences. They both contain
tokens (words or amino acids) which are in some way connected to other tokens
where the distance between the tokens in the sequence is not necessarily important,
and they can both vary in length. Further, these models are often used for sentiment
analysis, that is classifying sentences into two or more groups, which is similar to
classifying proteins into antibiotic resistant or not. From this it should be clear why
a language model would make a good candidate for dealing with proteins. Another
aspect we should take into consideration is that the set of known antibiotic genes

6


2. Theory

within our family of proteins is rather small. Therefore, we require a model that
somehow can contend with small data sets. A quick search into language AI model
and DNA or proteins yields several methods but the most commonly represented
one is the relatively new BERT model [11].

The BERT model is a natural language processing model developed by Google in
2018. This model is essentially a normal transformer but with a specific training
regimen. The model meets our set out targets by employing what is known as
self attention. This mechanism in essence performs a vector and transposed vector
into matrix multiplication that creates a score between each pair of tokens in the
input. This means that the model is distance agnostic and able to see distant
connections in the input data. To account for the varying sequence length the
model has a maximum input length and employs padding to fill out all inputs. This
is of course one downside which we must accept since we cannot use methods that
traverse the sequence. Finally, the BERT model is also trained in two steps, a pre-
training step where it trains on unlabelled data to learn the grammar of language (or
proteins in our case), and a fine-tuning step where we introduce the subject specific
data to classify. The idea behind this is that with the pre-training, the model will
learn general ideas about the language which helps it with classifying latter on. In
layman’s terms this entails that it requires less labelled training data. Since this
model is otherwise highly acclaimed and suitable for our needs, we elected to go
forward using it.

2.3.2 Transformers

The most integral part of the BERT model is the transformer. This machine learning
model was first introduced in the paper Attention Is All You Need in 2017 [10]. The
original paper does a good job of introducing and explaining their model, and for
a deeper look into how it works it is a recommended read. For our case, we will
lightly go other the model and its components to give name to and understanding
of the different model parameters which we have to select and contend with. To
begin let us view a single Encoder, this can be seen in Fig. 2.5. The input to the
Encoder is copied three times creating the Keys, Values, and Queries, which are
feed into the Multi-Head Attention. The output of this layer is then added to the
original input and then layer normalized. The output of this operation is then feed
to a Feed Forward Network, which also performs a similar residual connection and
layer normalization. Since we keep a residual connection, the output of each sub
unit must be same as the input, which is the Hidden Dimension Size or dmodel. We
will also note that we make use of dropout layers before each residual addition, but
these are not shown in the figure. This gives an equation for each sub layer as such
LayerNorm(x + DropOut(Sublayer(x))).

7


2. Theory

Multi-Head
Attention

Add & Norm

Feed
Forward

Add & Norm

Figure 2.5: A single Encoder block. The Encoder consists of two sub units, the
Multi-Head Attention and the Feed Forward network. Each sub units output is sub-
jected to dropout (not shown) added to the original input and then layer normalized
before being feed forward.

The most important part of the Encoder is the Multi-Head Attention block, which
can be seen in Fig. 2.6. In the figure each head corresponds to the several partially
see through layers and the number of heads is denominated by h. The Keys, Values,
and Queries are all feed into individual linear layers which project the inputs into
the Self-attention Internal Dimension Size dself−attention before being inputted into
the Scaled Dot-Product Attention. This block can be seen to the left in the figure,
in which we perform a series of simply operations. In essence the Keys and Queries
are used to create an attention matrix which denotes how important each token in
the input is to one another, finally this attention matrix is subjected to dropout
before being multiplied with the unchanged Values input to produce the output.
After this the output of all heads is concatenated and feed into a linear layer which
projects the dimension down into the original input dimension.

8


2. Theory

Concat

Scaled Dot-Product
Attention

Linear Linear Linear

Linear

MatMul

Scale

SoftMax

MatMul

Multi-Head Attention

Scaled Dot-Product 

Attention

h

Figure 2.6: The components within the Encoder. Not shown in the figure is the
dropout that occurs on the output of the softmax layer before the final matrix
multiplication in the Scaled Dot-Product Attention.

With all of these various parts it is a good idea to clarify the parameters. The Hid-
den Dimension Size or dmodel is the embedding size of the tokens and the consistent
dimension to which the data is projected to through a linear layer after each sub-
layer in the model. The Maximum Sequence Length, or lmax, is the longest sequence
which we can feed the model, and since one of these position will always be taken
by the [CLS] token, the actual maximum length for the amino acid sequences is 63.
The Number of Attention Heads, or h, is the number of parallel layers within the
Multi-Head Attention sublayer. The Self-Attention Internal Size, or dself−attention, is
the dimension which the first linear layers inside the Multi-Head Attention sublayer
project the input data to. The FFN Internal Size, or dF F N , is the size which the first
linear layer in the FFN network in the Encoder projects the input to. The Encoder
Stack Depth, or ddepth, is the number of Encoders we stack together to create the
BERT model. The Dropout Rate, or rdrop, is the rate of inputs that are ignored in
the dropout layers.

9


2. Theory

10


3
Methods

3.1 The Data Sets
In this work we utilize different data sets for different parts of the training, thus we
wish to elucidate these and present a visual picture to help with understanding. The
primary data set consists of proteins belonging to the beta-lactamase/transpeptidase-
like superfamily of proteins. In Fig. 3.1, we have the set of all bacterial DNA in
a, within this set we have our family of interest in b, within that we have a set of
HMM labelled data in c, and finally we have our true labelled data in d. We note
that the true labelled data may have some overlap with the HMM labelled data. If
we wished to pre-train a general BERT for proteins, we would pre-train it on the
outer group a. However, since we wish to operate exclusively on our specific family
of possibly antibiotic DNA, we can gain some extra performance by restricting our
set to only b. For the fine-tuning part, if we had enough true labelled data, we
would make exclusive use of d. This is unfortunately not the case, thus we fine-tune
on c and use d as validation data.

a
b

c
d

Figure 3.1: Visual representation of the various data sets used in this work. a
represents all bacterial DNA, b presents our specific family of bacteria, c represents
DNA within this family which has been labelled by using a HMM model, and d
represents our set of true labelled DNA.

Another thing to note is that the reads in our data sets are generally not complete
reads of the entire DNA sequence making up the protein, but snippets of them.
Our reads tend to have a length of between 20-50 amino acids, whilst the complete
sequences tend to have lengths around 300 amino acids long.

11


3. Methods

Later on we also wish to perform a cross-validation of the BERT model. We do this
since it is possible that the HMM labelled data contains high correlations to the true
labelled data since the HMM model is built upon using the true data. To determine
if this is unfairly allowing the BERT model to somehow train on the validation data
through the HMM data, we must somehow separate these dependencies and cross-
validate the model. We do this by combining all training and validation data into
one set, we then run this set through a clustering algorithm (UCLUST) with an
identity threshold of 80% [12]. We then separate these clusters into six individual
folds in a stratified manner, that is we try to keep the ratio between positive and
negative and originally training and originally validation data consistent between
the folds. A graphic representation of this can be seen in Fig. 3.2.

Figure 3.2: Visual representation grouping of data to perform the cross-validation.
The dark orange blocks represent the original validation data, the yellow blocks
represent the original training data. The pluses and minuses represent the sets of
antibiotic resistant positive proteins and negative ones.

Finally we will also test the model on a set of metegenomic reads from waste water
treatment plants located in Sweden, India, and Senegal, and a set of similar reads
from the human gut.

3.2 The BERT Model
To construct the model, we stack a set of Encoders as described in the Theory
section on top of one another. We then add a tokinization step at the input of the
model. The tokenization is described by Fig. 3.3. We construct a vocabulary of
accepted inputs of the model as the set of all amino acids and the special tokens
[CLS], [PAD], and [MSK]. In order to transform these tags into vectors with which
the model can operate, we construct a vector representation for each token. This
vector representation is simple in the sense that we just assign each token a vector

12


3. Methods

in some space such that the vectors are evenly distributed. We also assign vectors
to all possible positions in the input sequence. The final vector representation of a
specific token at a specific position in the input is then achieved as the sum of the
token and position embedding.

E0

E
[CLS]

Position

Token

Input

E1

EM

E2

ES

E3

ES

En-2

EL

En-1

EL

En

EA

MSSSFFIP ... SQILLSLLA
Figure 3.3: The embedding of sequences into vectors, the embedding takes both
the token and the position into account.

Once the model has been constructed the next step is pre-training. The pre-training
consists of taking unlabeled proteins, concealing or randomizing some of the tokens,
and then asking the model to predict what these tokens should be. The advantage
of doing this is that it allows the model to learn a general understanding of proteins,
without requiring the use of labeled data. This is crucial as it will decrease the
amount of labeled data we need in the fine tuning of which we have very little, only
around 200 sequences. How this is done can be seen in Fig. 3.4.

C
L
S

I nI n
-1

I n
-2

I 3I 2I 1C
L
S

BERT

O
n

O
n-
1

O
n-
2

O
3

O
2

O
1

C
L
S

M S S P V ACLS

S A P V ACLS MSK

Linear

Figure 3.4: The pre-training step. Every sequence gets modified by masking or
changing some of the input tokens, the model is then tasked to predict the original
token.

For each sequence we select 15% of tokens to be modified. Of these 15%, 80% get

13


3. Methods

replaced with a [MSK] token, 10% gets replaced with a new random amino acid, and
the remaining 10% are left unchanged. We feed these sequences then through the
model and feed only the outputs corresponding to the modified tokens into a linear
layer which classifies the output into one of the 22 amino acids. One important thing
to note is that these modification are done every time a protein is presented to the
model, thus we do not expect the model to suffer from overfitting from this step as
the model nearly never sees the same input twice. We train the model until the loss
enters a steady state. We also keep a set of validation data as a sanity check, but
strictly speaking this should be unnecessary.

Table 3.1: The selected model parameters.

Model Feature Value Model Feature Value
Hidden Dimension Size 512 FFN Internal Size 64
Maximum Sequence Length 64 Encoder Stack Depth 8
Number of Attention Heads 8 Dropout Rate 0.1
Self-Attention Internal Size 64

Since the pre-training step takes a lot of time and computational power, we must
perform the parameter selection at this step. We selected our initial parameters as
roughly 80% the parameters from the original BERT model [11], these values can
be seen in 3.1. Then we ran relatively short (on the order of 1/20th of the time
the full pre-training would end up taking) pre-training steps whilst varying a single
parameter. The result of these runs can be seen in Fig. 3.5. We wished to limit the
total number of parameters as to not slow down the model to much, thus we ended
up selecting the higher values for the hidden dimension size and the self-attention
internal dimension size, but we kept the encoder stack depth at 8.

14


3. Methods

0 2000 4000 6000 8000 10000 12000

2.2

2.4

2.6

2.8

3.0 a)

Hidden Dimension Size

256
128
64

Training
Validation

0 2000 4000 6000 8000 10000 12000

2.2

2.4

2.6

2.8

3.0

Lo
ss

b)

Self-Attention Internal Dimension Size

64
32
16

Training
Validation

0 2000 4000 6000 8000 10000 12000
Batch Number

2.2

2.4

2.6

2.8

3.0 c)

Encoder Stack Depth

10
8
6

Training
Validation

Figure 3.5: Pre-training runs performed whilst varying a single parameter, these
runs ended up taking on the order of 1/20th the time it took for the final pre-training.

The full pre-training was run one a Nvidia GeForce RTX 2080 TI graphics card
and took roughly 30h of computation time to complete. The evolution of the loss
function can be seen in Fig. 3.6. We note that the sudden shift from periodic
behaviour to the smother line at around batch 300000 is due to a change from fixed
to random batch order. The corresponding accuracy of the pre-training ended up at
roughly 43%, that is the model was capable of guessing correctly what amino acid
had been either masked, changed, or left unchanged 43% of the time.

15


3. Methods

0 200000 400000 600000 800000
Batch Number

0

2
Lo

ss
Training Validation

0.0

0.5

1.0

Figure 3.6: The loss function as it evolves over time during the pre-training, the
shift from periodic to non periodic behavior around batch 300000 is due to a change
from fixed to random batch order.

With the pre-training done we now turn to the fine-tuning. The fine-tuning is com-
parably an easier step now that all the heavy lifting has been performed by the
pre-training. We simply feed an unmodified sequence into the model and pass the
[CLS] token output into a linear model to classify it into either antibiotic resistant
or not. We also introduce some noise into the training data by randomly taking as
sub sequence of each sequence with a randomly length between 10 and 63. This does
two things, it helps the model later on when we wish to predict on short sequences,
and it also helps fight overfitting. A graphic of the fine-tuning can be seen in Fig. 3.7.

Since our version of BERT does not contain any Next Sentence Prediction or similar
step in the pre-training as described in the original BERT paper [11], our [CLS]
output is untrained, and has only access to the learned attention at each Encoder
step. This entails that we cannot freeze the parameters of the BERT model during
our fine-tuning as many schemes using a pre-trained BERT model usually does.
Even now, with the pre-training that drastically lowers the amount of required
labelled data, we still did not have enough data to fully train the BERT model.
We circumnavigated this issue by using a HMM model to create a systematically
labelled data set, and then we used our true labelled data set as our validation set.

16


3. Methods

C
L
S

I nI n
-1

I n
-2

I 3I 2I 1C
L
S

BERT

O
n

O
n-
1

O
n-
2

O
3

O
2

O
1

C
L
S

M S S P V ACLS

Linear

Figure 3.7: A graphic representation of the fine-tuning procedure.

Looking at Fig. 3.8, we see that the fine-tuning converges orders of magnitude faster
as compared to the pre-training. Some back of the envelope calculations suggests
that if we had not made use of the HMM labelled data, we would require about an
order of magnitude more true labelled data than we have.

0 500 1000 1500 2000
Batch Number

0.0

0.5

Lo
ss

Training
Validation

0.0

0.5

1.0

A
cc

ur
ac

y

Figure 3.8: The loss and accuracy of the fine-tuning.

17


3. Methods

18


4
Results

4.1 Comparing with Previous Methods

To evaluate the performance of the BERT model, we compare its performance to a
HMM model across various metrics. To start of we wish to note that HMM model
only produces positive scores of how likely it believes that a protein is antibiotic
resistant. The model also discards most proteins that fall beneath a specific score
threshold, this is done to improve the time performance of the model. This all
entails that proteins which we know to not exhibit antibiotic resistant may fail to
show up in the output of the HMM model. To account for this we have elected to
reintroduce all discarded proteins by assigning them a uniform random HMM score
between 0 and the lower threshold score of the HMM model.

To begin with we study the receiver operator characteristic (ROC) curves as seen
in Fig. 4.1 a). This figure shows the ability of the two models to classify whether
a protein is antibiotic resistant or not. For the case of a perfect classifier the ROC
curve would connect the two ends of the dashed line forming an equilateral triangle,
and a classifier that randomly guesses would correspond to a ROC curve that follows
the dashed line. In the ROC curves shown we have neglected to differentiate the
data with respect to the sequence length of the proteins, thus the curves are averaged
across all sequence lengths (10-63). We see that the BERT model outperforms the
HMM model. In the second figure Fig. 4.1 b), we see the corresponding area under
the ROC curves. In these graphs we have not averaged across the sequence length
and thus can see its effect. A perfect classifier would have an ROC area of 1 whilst
a random classifier would 0.5, since it would be the area beneath the dashed line
in Fig. 4.1 a). We see that for short sequence lengths the HMM model essentially
guessing, whilst the BERT model performs admirably 0.9. We see that for longer
sequence lengths both model perform similarly.

19


4. Results

0.0 0.5 1.0
False Positive Rate

0.0

0.5

1.0

Tr
ue

 P
os

iti
ve

 R
at

e

a)
20 40 60

Sequence Length

0.6

0.8

1.0

A
re

a 
un

de
r R

O
C b)

BERT
HMM

Figure 4.1: a) ROC curve comparing the predictive capabilities of the BERT and
HMM models. The curve shown is averaged across all sequence lengths (10-63). b)
The area under the ROC curve as a function of the protein sequence length.

Moving on we are also interested in more specific metrics to compare the models
with, these can be seen in Fig. 4.2. These figures were produced by selecting a
single cutoff threshold to be used for all sequence lengths. This was done since this
how an HMM model is used normally, we do not vary the cutoff threshold with the
sequence length of the proteins. This threshold was selected by taking the threshold
which corresponds to the point on the ROC curve Fig. 4.2 a) closest to the upper
left corner. The effects of our choices can be seen rather starkly in the figure as
the apparent performance of the HMM model tappers off after a sequence length
of 45. It is not the case that the HMM model performs worse on longer sequence
lengths, we have all ready seen this to be the case in the previous figure Fig. 4.1
b). Instead what this shows us that the the HMM model does separate the two
distributions (positive and negative proteins regarding antibiotic resistance), thus
the models capability is very dependent on the choice of cutoff threshold and suffers
because of this since we select the best cutoff threshold averaged across all sequence
lengths. On the other hand, we do not see this with the BERT model, implying it
does separate the two classes more reasonably.

Taking a look at the actual curves in the figures Fig. 4.2, we again see that the
HMM model is essentially guessing when it comes to proteins with a short sequence
length. In the lower figure we see that for short sequences, the HMM model almost
always guesses that the protein is negative. This is the case likely due to the fact,
as discussed in more detail in the beginning of this chapter, that we reintroduced
sequences which the HMM model did not produce a score for with a low HMM score.
Since the HMM model struggles with short sequences, mos of these get reintroduced
as negative proteins. We see that the BERT model struggles slightly with both the
false positive rate and the false negative rate for short sequences, of which the
false positive rate is of utmost importance. Still it performs much better than the
HMM model which for all intents and purposes does not function for short sequence
lengths.

20


4. Results

10 20 30 40 50 60
0.0

0.5

1.0
R

at
e

Sensitivity
Specificity
Precision

Accuracy
BERT
HMM

10 20 30 40 50 60
Sequence Length

0.0

0.2

0.4

R
at

e TP
TN
FP

FN
BERT
HMM

Figure 4.2: A variety of metrics comparing the BERT and HMM models for differ-
ent sequence lengths. The optimal cutoff threshold used was obtained by selecting
the threshold corresponding to the point of the ROC curve (Fig. 4.1) closest to the
upper left corner.

We can further study how well the two methods perform in separating the two
distributions of positive and negative proteins by plotting the distributions against
one another in a 2D distribution plot. These plots can be seen in figure Fig. 4.3. In
the figure we have separated the proteins again by sequence length, where a) shows
the short proteins, b) shows the medium length ones, c) shows the long proteins,
and finally d) shows all proteins. We again see that for the short proteins, the BERT
model separates the two distributions well whereas the HMM model struggles, but
for longer sequences this difference diminishes. We can now also see evidence of the
earlier suggested instability in selecting the cutoff threshold for the HMM model. In
each figure, it is clear that a small shift of the horizontal dashed line would cause a
significant shift in the performance of the HMM model as the negative distribution
consists of a sharp peak along the HMM axis.

21


4. Results

0.0 0.5 1.0
0.0

0.1

0.2

0.3

0.4
H

M
M

 S
co

re
a)

0.0 0.5 1.0
0.0

0.2

0.4

0.6
b)

0.0 0.5 1.0
BERT Positive Score

0.00

0.25

0.50

0.75

1.00

H
M

M
 S

co
re

c)

0.0 0.5 1.0
BERT Positive Score

0.00

0.25

0.50

0.75

1.00 d)

Figure 4.3: The figures show the distribution of proteins in the 2D space of BERT
and HMM scores. The dashed lines correspond to the cutoff thresholds of the two
models, and the blue and red distributions correspond to the negative and positive
proteins respectively. All figures are shown in the same linear scale. a) Shows
sequence lengths between 10 and 17, b) shows between 18 and 35, c) shows between
36 and 53, and d) shows for all sequence lengths.

To get a clearer view of what happens where the two distributions overlap we also
study the corresponding 1D distributions along each axis, these can be seen in Fig.
4.4. We see that in row a), which corresponds to the short proteins, the HMM
model fails to separate a large share of the positive proteins from the negative ones,
the same might also be said for the BERT model, but its share of overlap is much
smaller. In row b) we see that the HMM model still struggles to separate the two
groups whilst the BERT model does so almost perfectly. Row c) shows again that
for the long sequences there is little difference between the two models.

22


4. Results

a)

b)

c)

0.0 0.5 1.0
HMM Score

d)

0.0 0.5 1.0
BERT Score

Figure 4.4: The corresponding 1D distributions to those shown in Fig. 4.3. The
left column shows the distributions along the HMM axis and the right shows them
along the BERT axis. As in the previous figure, a) shows the short sequences, b) the
medium ones, c) the long ones, and d) shows all lengths. Again, the red distribution
shows the positive proteins whilst the blue shows the negative proteins.

To get an understanding of how well the two models separate the distributions
we can measure the overlapping area between them and compare between models,
this can be seen in Fig. 4.5. In this figure, a separation of 1 means that the two
distributions have no overlapping area, whilst 0 means that they overlap completely.
We see that for very short sequence lengths (<15), the HMM model distributions
have half or more of their area overlapping. Further we see that the BERT model
performs with sequence lengths of 10 as well as the HMM model does with sequence
lengths of 18. The two models equalize at roughly sequence lengths of 25, after
which the difference is to small to confidently make any judgment.

23


4. Results

10 20 30 40 50 60
Sequence Length

0.5

1.0

Se
pe

ra
tio

n
BERT
HMM

Figure 4.5: The degree of separation between the two distributions in the sense
of measuring the area of overlap between the two distributions. For the case of a
perfect classifier the separation would be 1, meaning no overlap, whilst a separation
of 0 means that the two distributions overlap completely.

Since the HMM model is based upon, amongst other things, sequence alignment, it
was expected to be highly resistant towards reading errors. That is, random errors
in which a set of amino acids are read incorrectly and reported as another amino
acid. These errors are expected and occur with varying error rate depending on the
method, thus our models must be resistant towards such errors. To determine if such
was the case we elected to test both models upon our validation data, but where
we randomly introduced reading errors into said data at varying rates. The scheme
used to do this similar to the pre-training step, with the only difference being that
we only made use of randomly changing amino acids. In order to only see the effect
of the read errors, we chose to only work with full length sequences, of length 63,
since we all ready the effect of sequence length on both models.

0.0 0.5 1.0
False Positive Rate

0.0

0.5

1.0

Tr
ue

 P
os

iti
ve

 R
at

e

a)
0% 10% 20%

Error Rate

0.996

0.998

A
re

a 
un

de
r R

O
C b)

BERT
HMM

Figure 4.6: a) ROC showing the performance of the BERT and HMM models
when the proteins have had random errors introduced. The ROC shown has been
averaged across all chosen error rates. b) The area under the ROC for different
error rates.

We can see in Fig. 4.6 that both models perform well for all error rates. The

24


4. Results

difference we see between the two models is to small to make any conclusion. The
reach the same verdict when studying the metrics in Fig. 4.7.

0% 2% 5% 8% 10% 12% 15% 18% 20%
0.0

0.5

1.0

R
at

e

Sensitivity
Specificity
Precision

Accuracy
BERT
HMM

0% 2% 5% 8% 10% 12% 15% 18% 20%
Error Rate

0.0

0.2

0.4

R
at

e

TP
TN
FP

FN
BERT
HMM

Figure 4.7: Various metrics to compare the BERT and HMM model when oper-
ating on proteins with introduced read errors.

4.2 Testing Training Transferability
As stated earlier, we chose to restrict our BERT to training upon an single super
family of proteins in order to maximize performance. However, we believe that
most of the learned information from the pre-training step is general to most pro-
teins and thus transferable. To test this we selected another super family, known
as methyltransferase, and re did the pre-training step upon this super family. How-
ever, instead of starting out from with an uninitialised BERT model, we made use
of the one already trained on the betalactase super family. Doing this we found
that the pre-training completed within less then a single epoch, and time wise the
training took about a minute compared to the original 30 hours. This tells us that
the pre-training was indeed quite general.

In conjugation with the above we also wished to test the BERT models performance
on a data set containing sequences from both super families. The reason behind
this is that a general BERT able to classify all protein families related to antibiotic
resistance would be a much preferable model to have. To do this we created a
balanced data set between the two families and performed the pre-training and fine-
tuning on said set. As before the pre-training took minutes, and the same was

25


4. Results

true of the fine-tuning. We found the final performance of the BERT model on this
mixed data set to be slightly worse than when the model was trained on the two sets
separately. When the model classified the two sets separately the average accuracy
of the validation data was around 99%, when combining the sets the accuracy was
around 98%. Thus we pay a small tax in accuracy to gain generality.

4.3 Cross-Validating the Model

Since we elected to fine-tune the model upon HMM labelled data, we had concerns
that it was possible that the model was unfairly gaining insight into the validation
data through the training data. What we mean by this is that the HMM model has
been in essence ’trained’ on our validation data. The HMM model works essentially
by comparing new proteins with those in its bank. This means that it is possible
that the HMM labelled data contains proteins that are identical or at least highly
correlated to the validation data. In turn this could imply that the BERT model is
not learning anything general but simply learning the entire validation set.

As stated in the Method segment, we avoid the above issues by performing a cross-
validation of the fine tuning. How we separated the data into folds can be seen in
the Method segment. The cross-validation helps ensure that the correlation between
the training and validation data does not over inflate the performance of the BERT
model. The comparison between the standard model and the cross-validated model
can be seen in the figure Fig. 4.8. Here we see that the BERT model has truly learnt
something general about the data and not simply managed to recreate the validation
data. The various metrics in Fig. 4.9 also lend credence to this conclusion.

0.0 0.5 1.0
False Positive Rate

0.0

0.5

1.0

Tr
ue

 P
os

iti
ve

 R
at

e

a)
20 40 60

Sequence Length

0.90

0.95

1.00

A
re

a 
un

de
r R

O
C b)

Standard
CV

Figure 4.8: ROC showing the difference between the standard model and the
cross-validated one. The left figure shows the ROC curves for sequences of various
amino acid lengths, and the right figure shows the area beneath the curves to show
how the models perform for various sequence lengths.

26


4. Results

10 20 30 40 50 60
0.0

0.5

1.0
R

at
e

Sensitivity
Specificity
Precision

Accuracy
Standard
CV

10 20 30 40 50 60
Sequence Length

0.0

0.2

0.4

R
at

e TP
TN
FP

FN
Standard
CV

Figure 4.9: Various metrics comparing the standard and cross-validated versions
of the BERT model.

4.4 Timing the Model
It is quite difficult to compare the two models in terms of computation times. The
reason for this is that the models run on different architectures. The HMM model is
run on the CPU whilst the BERT model is run on a GPU. However, for the use case
intended for the BERT model, we can still make a qualitative comparison. Running
on a GeForce RTX 2080 Ti, the BERT model can process 10 million sequences in
roughly 30 minutes. The same sequences take roughly 2 minutes for the HMM
model running on 40 CPUs. This tells us that for our use case, the HMM model is
an order of magnitude faster then the BERT model. This difference can be made
even bigger if we slacken the HMM somewhat which entails a less detailed result,
but this increases the difference to two orders of magnitude.

4.5 Glimpsing into the Model
In order to understand what the model has learned, let us take a look at the which
parts of a protein the model considers important when classifying a positive protein
as such. We can do this by taking many snippets of a protein and running each
snippet through our models, then we combine the scores for each snippet and thus
we can get a score for each amino acid in the protein. We get the snippets by
traversing the protein with varying window sizes. The result of such runs can be

27


4. Results

seen in Fig. 4.10. Here we have chosen three positive proteins to take a look at.
What we can note from these graphs is that the first peaks in them, tend to fall
between amino acid 60 and 100, which is where we expect the active site to exists
in our family of proteins [13]. Thus we can note that one possible thing the model
has learned is in how to identify such sites, but we will not delve deeper into the
analysis of this.

0 50 100 150 200 250 300

0 50 100 150 200 250

0 50 100 150 200 250 300
Protein Amino Acid Number

HMM BERT

Figure 4.10: The graphs show how important each part of the protein is in order to
classify it as antibiotic resistant. Each protein shown is a known antibiotic resistant
protein. The dashed line shows the optimal threshold, and the graphs for each model
are aligned such that their thresholds overlap.

Something else which we might take a look at is a Principle Component Analysis
(PCA) of the [CLS] token output of the BERT model. In Fig. 4.11 we see the dis-
tributions of positive and negative proteins along the two most significant principle
components. The left column shows the cases where the model was correct whilst
the left shows when they miss classified. The top row shows the classification as done
by the BERT model and the bottom row shows the HMM model. For the BERT
model, we see what we expect. It has managed to separate the two distributions
quite well and inserted a decision boundary between them.

28


4. Results

10 0 10

5

0

5

B
ER

T
a)

Correct

10 0 10

5

0

5

b)
Incorrect

10 0 10

5

0

5

H
M

M

c)

10 0 10

5

0

5

d)

Figure 4.11: PCA plots of the [CLS] token output of the BERT model. The left
column shows the two distributions (red for positive and blue for negative proteins)
when the model is correct, whilst the right column shows when the model is incorrect.
The top row shows the BERT models classification and the bottom row shows the
same for the HMM model. All plots are made in the same linear scale.

We can see the same results but projected onto the axis in Fig. 4.12 for a cleared
picture where the distributions overlap. The main purpose of studying these figure
is to ensure that the AI model has learned something different than what the HMM
model knows. We see this to be the case as the two models clearly differentiate
between the two classes differently. This allows us to rest easy knowing that our
choice of training the BERT model upon HMM labeled data has not merely created
an AI model that models the HMM process, but rather a model that has learned
something general about the data which is not captured by the HMM model.

29


4. Results

10 5 0 5 10 15
PC1

a)

7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
PC2

b) Both Correct
BERT Correct
HMM Correct
Both Incorrect

Figure 4.12: The corresponding distributions to the 2D distributions in Fig. 4.11.

4.6 Testing on Metegenomic Reads

Finally we can also test our model in its intended use case, searching through metage-
nomic reads. For this reason we have complied a set of samples, one set from the
human gut and one from various waste water treatment plants from various coun-
tries. The results of running the samples through our models can be seen in Fig.
4.13 and Fig. 4.14. Since these reads are unlabeled, we can say little regarding the
performance of each model. In these figures the vast majority of reads fall below the
HMM cutoff threshold, so in order to make the figures readable we elected to not
show these points in the figures. The breakdown of where the reads fall categorically
is quite consistent between the samples and is as such: roughly 93.5% of all reads fall
beneath both models cutoff threshold (lower left quadrant), roughly 6.5% fall in the
lower right quadrant, and the remaining few reads are distributed roughly equally
in the two quadrants shown in the figures. Each figure shown represents a total
sample size of 10 million reads, but only a minuscule fraction of these fall within the
quadrant of interest (the upper right one). Within each figure there exists on the
order of 100 points or fewer in each upper right quadrant, and these reads are the
ones most likely to be antibiotic resistant.

30


4. Results

0.0 0.2 0.4 0.6 0.8 1.0

0.25

0.50 a)

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.4

H
M

M
 S

co
re b)

0.0 0.2 0.4 0.6 0.8 1.0
BERT Score

0.2

0.4 c)

20 30

20 30

20 30

Figure 4.13: Graph showing the model output on unlabeled metagenomic reads.
The insert shows the sequence length distribution. The dashed lines shows the cutoff
thresholds obtained previously. The samples shown are three different metegenomic
samples from the human gut. The vast majority of reads scored below the HMM
cutoff, these are not included in the figure

In order to fully determine to performance of our model we would need to construct
these proteins and test them in a wet lab, but this is to time consuming and would
not fit within the time frame of this thesis work. One thing to note is that although
the vast majority of points can be cast aside using only the BERT model, one is still
left with roughly 6.5% of the points if we use the optimal cutoff threshold from the
validation data. This number is to vast to be useful, thus we arrive at two possible
ways to apply the model. The first is to use the HMM model as a pre-screening
step. Since the HMM model is a couple orders of magnitude faster than the BERT
model, this is not to cumbersome. The only negative to this strategy lies in that it
would only allow us to further narrow down the output of the HMM model, and we
know from the validation data that there exists proteins that are positive which the
HMM model fails to capture but the BERT model can detect. The other option is to
simply make use of a much higher cutoff threshold. There seems to exists evidence
for this method to work well in the figures as one may note that each figure contains
a line of points with nearly perfect BERT scores.

31


4. Results

0.0 0.2 0.4 0.6 0.8 1.0

0.25

0.50
a)

0.0 0.2 0.4 0.6 0.8 1.0

0.25

0.50

0.75

H
M

M
 S

co
re b)

0.0 0.2 0.4 0.6 0.8 1.0
BERT Score

0.0

0.5
c)

20 40

20 40

20 40

Figure 4.14: More metegenomic reads, this time from waste water treatment plants
from various countries, a) is from India, b) Senegal, and c) Sweden.

4.7 Discussion and Conclusion
From the various results which we have presented, we can make a couple of state-
ments. The most glaring aspect in which the BERT model outperformed the HMM
model is the matter of sequence length. Depending on which of the various metrics
shown in the results one selects to go by, the exact cutoff when the two models
equalize in performance varies. The highest such value was found when comparing
the area under the ROC in Fig. 4.1. Here the BERT model outshone the HMM
model for sequences lengths of 50 and shorter, and the most striking difference was
found at a sequence length of 10 where the BERT model was as good as the HMM
model was at sequence lengths of 25. The smallest difference between the BERT
and HMM model was when comparing the area overlap between the positive and
negative distributions of sequences in each models predictive space as seen in Fig.
4.5. Here the BERT model outperformed the HMM model for sequence lengths of 25
and shorter, but it also had a small most likely insignificant lead up until sequence
lengths of 50. From these and the other comparisons we find that the BERT model
is capable of correctly classifying most short sequences whilst the HMM model gives
random predictions for these sequences.

32


4. Results

Another important aspect lies in the stability of each method. What we mean
with this is the sensitivity of the model upon the selection of their cutoff thresh-
old. We saw the effect of this in the ROC and metrics plots of Fig. 4.1 and Fig.
4.2. In these figures we saw that the HMM method seemingly performed worse
with sequence lengths above the length of 45, and this is due to the optimal cutoff
threshold being quite unstable and variable with changing sequence lengths. Thus,
once a optimal threshold is selected by including all sequence lengths in the com-
putation, this causes the model to miss classify the longer sequences which should
be easier to classify. We do not see this behaviour with the BERT model, meaning
it has a more stable optimal cutoff that does not vary to much with sequence length.

Unfortunately we also have to pay something for these increased performance fig-
ures, and in this case that payment lies in computation time. Of course, we are
comparing apples to oranges, but for our supervisors use case, the HMM model has
a computation time that is one to two orders of magnitude faster than the BERT
model. However, the computation time of the BERT model was determined to be
acceptable, capable of processing 10 million sequences in 30 minutes.

Looking into the future of this project, we found that our model was capable of
becoming more general by training it on multiple protein super families. This could
be the next step in order to create a general antibiotic resistance BERT model.
Weather or not this is done, the model should be put to the final test by synthesiz-
ing the detected proteins from the metagenomic reads in a wet lab.

Going back to our stated goals, we have shown that the BERT model is indeed
capable of identifying antibacterial microbial DNA. It also outperforms the HMM
model, and especially so on the short reads which we tend to get with metagenomic
reads. Further, we have also seen that the BERT model has learned something
different than the HMM model, what this tells us is that we can gain even better
results by utilizing both models at the same time and combining their results. This
is not an issue computation time wise as the BERT model is already the bottle
neck in that regard. Thus we recommend that the trained BERT model be used in
combination with the HMM model for future use.

33


4. Results

34


Bibliography

[1] Debstar, "Spombe Pop2p protein structure rainbow," 2015. [Online]. Available:
https://commons.wikimedia.org/wiki/File:Spombe_Pop2p_protein_
structure_rainbow.png (accessed on 2022-05-12)

[2] World Health Organization, "Antibiotic resistance," 2022, [Online],
Available: https://www.who.int/news-room/fact-sheets/detail/
antibiotic-resistance (accessed on 2022-05-12)

[3] "ANTIBIOTIC RESISTANCE THREATS IN THE UNITED STATES," Cen-
ters for Disease Control and Prevention, USA, 2019. [Online]. Availale: https:
//www.cdc.gov/drugresistance/biggest-threats.html

[4] Antimicrobial Resistance Collaborators, "Global burden of bacterial antimicro-
bial resistance in 2019: a systematic analysis," The Lancet, vol. 399, issue 10325,
pp. 629-655, FEB. 2022, doi:10.1016/S0140-6736(21)02724-0

[5] World Health Organization, "The top 10 causes of death," 2020, [On-
line], Available: https://www.who.int/news-room/fact-sheets/detail/
the-top-10-causes-of-death#:~:text=The%20top%20global%20causes%
20of,birth%20asphyxia%20and%20birth%20trauma%2C (accessed on 2022-
05-12)

[6] J. Jumper, R. Evans, A. Pritzel, et al., "Highly accurate protein structure
prediction with AlphaFold," Nature 596, 583–589, 2021, doi:10.1038/s41586-
021-03819-2

[7] "How Bacteria and Fungi Fight Back Against Antibiotics," Centers
for Disease Control and Prevention, USA, 2021. [Online]. Availale:
https://www.cdc.gov/drugresistance/about/how-resistance-happens.
html#:~:text=Antibiotic%20resistance%20is%20accelerated%20when,
resistant%20germs%20survive%20and%20multiply.

[8] T. Shafee, "Enzyme Structure," 2015. [Online]. Available: https://commons.
wikimedia.org/wiki/File:Enzyme_structure.svg (accessed on 2022-05-12)

[9] N. Stoler, A. Nekrutenko, "Sequencing error profiles of Illumina sequencing
instruments," NAR Genomics and Bioinformatics, Volume 3, Issue 1, March
2021, doi:10.1093/nargab/lqab019

[10] A. Vaswani, et al., "Attention Is All You Need," 2017,
doi:10.48550/arxiv.1706.03762

[11] J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding", 2018,
doi:10.48550/ARXIV.1810.04805

[12] R. Edgar, "UCLUST algorithm," usearch v11, [Online]. Available: https://
drive5.com/usearch/manual/uclust_algo.html (accessed on 2022-05-12)

35

https://commons.wikimedia.org/wiki/File:Spombe_Pop2p_protein_structure_rainbow.png
https://commons.wikimedia.org/wiki/File:Spombe_Pop2p_protein_structure_rainbow.png
https://www.who.int/news-room/fact-sheets/detail/antibiotic-resistance
https://www.who.int/news-room/fact-sheets/detail/antibiotic-resistance
https://www.cdc.gov/drugresistance/biggest-threats.html
https://www.cdc.gov/drugresistance/biggest-threats.html
https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death#:~:text=The%20top%20global%20causes%20of,birth%20asphyxia%20and%20birth%20trauma%2C
https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death#:~:text=The%20top%20global%20causes%20of,birth%20asphyxia%20and%20birth%20trauma%2C
https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death#:~:text=The%20top%20global%20causes%20of,birth%20asphyxia%20and%20birth%20trauma%2C
https://www.cdc.gov/drugresistance/about/how-resistance-happens.html#:~:text=Antibiotic%20resistance%20is%20accelerated%20when,resistant%20germs%20survive%20and%20multiply.
https://www.cdc.gov/drugresistance/about/how-resistance-happens.html#:~:text=Antibiotic%20resistance%20is%20accelerated%20when,resistant%20germs%20survive%20and%20multiply.
https://www.cdc.gov/drugresistance/about/how-resistance-happens.html#:~:text=Antibiotic%20resistance%20is%20accelerated%20when,resistant%20germs%20survive%20and%20multiply.
https://commons.wikimedia.org/wiki/File:Enzyme_structure.svg
https://commons.wikimedia.org/wiki/File:Enzyme_structure.svg
https://drive5.com/usearch/manual/uclust_algo.html
https://drive5.com/usearch/manual/uclust_algo.html


Bibliography

[13] InterPro, "Beta-lactamase, class-A active site", [Online]. Available:
https://www.ebi.ac.uk/interpro/entry/InterPro/IPR023650/
structure/PDB/#table (accessed on 2022-05-12)

36

https://www.ebi.ac.uk/interpro/entry/InterPro/IPR023650/structure/PDB/#table
https://www.ebi.ac.uk/interpro/entry/InterPro/IPR023650/structure/PDB/#table


DEPARTMENT OF SOME SUBJECT OR TECHNOLOGY
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Acronyms
	Nomenclature
	List of Figures
	List of Tables
	Introduction
	Theory
	Proteins Synthesis and DNA Sequencing
	Protein Function
	The Model
	Model Selection
	Transformers


	Methods
	The Data Sets
	The BERT Model

	Results
	Comparing with Previous Methods
	Testing Training Transferability
	Cross-Validating the Model
	Timing the Model
	Glimpsing into the Model
	Testing on Metegenomic Reads
	Discussion and Conclusion

	Bibliography