Taxonomic Classification of Bacteria in
Shotgun Metagenomic Samples
Evaluation of Species Level Classification in the Context of
Pathogen Screening for Healthcare Applications

Master’s thesis in Engineering Mathematics and Computational Science

IRIS GOLD RODAL

DEPARTMENT OF MATHEMATICAL SCIENCES

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2023
www.chalmers.se


Master’s thesis 2023

Taxonomic Classification of Bacteria in Shotgun
Metagenomic Samples

Evaluation of Species Level Classification in the Context of Pathogen
Screening for Healthcare Applications

IRIS GOLD RODAL

Department of Mathematical Sciences
Division of Mathematical Statistics

Chalmers University of Technology
Gothenburg, Sweden 2023


Taxonomic Classification of Bacteria in Shotgun Metagenomic Samples Evaluation
of Species Level Classification in the Context of Pathogen Screening for Healthcare
Applications
IRIS GOLD RODAL

© IRIS GOLD RODAL, 2023.

Supervisor: Oscar Aspelin, Bioinformatician at 1928 Diagnostics
Examiner: Erik Kristiansson, Professor in Biostatistics and Bioinformatics at Chalmers
University of Technology

Master’s Thesis 2023
Department of Mathematical Sciences
Division of Mathematical Statistics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

iv


Taxonomic Classification of Bacteria in Shotgun Metagenomic Samples
Evaluation of Species Level Classification in the Context of Pathogen Screening for
Healthcare Applications
IRIS GOLD RODAL
Department of Mathematical Sciences
Chalmers University of Technology

Abstract
The aim was to investigate taxonomic classification and removal of human host
DNA in the context of a bioinformatic analysis pipeline for screening of pathogens.
The examination was carried out using simulated short-read sequenced shotgun
metagenomic samples. It was found that a majority of human origin DNA could
be separated from bacteria using the K-mer based read classifier Kraken 2 with a
custom built human only reference database. Effects on taxonomic classification
performance were surveyed for variations in sample composition, parameter settings
of the taxonomic classifier and reference database composition. Maintaining both
high precision and recall for species level taxonomic classification of metagenomic
samples was challenging for limited computational resources. A one-size-fits-all ap-
proach to taxonomic classification of any shotgun metagenomic sample would be
near impossible with the tested K-mer based classifiers (Kraken 2 and Bracken) and
instead specialized pipeline tracks optimized for different expected range of species,
sequencing depth and abundance distributions could be a solution.

Keywords: metagenome taxonomic classification, shotgun metagenomics, WGS,
Kraken 2, Bracken, taxonomic classifier, host removal.

v


Acknowledgements
This endeavor would not have been possible without collaboration with 1928 Diag-
nostics, who provided me with the opportunity to carry out my thesis project at
the company and I am thankful for the resources they provided. First and foremost
I would like to express my gratitude to my supervisor Oscar Aspelin for sharing
expert bioinformatics knowledge and for his continued guidance and encouragement
throughout the year. I am further thankful for the support from the development
team at 1928 Diagnostics, who, in addition to helpful coding tips and tricks, gave
me much appreciated insight into how problems similar to the one in this thesis are
tackled for real world applications. I would also like to direct many thanks my ex-
aminer Erik Kristiansson for taking on this project and my opponent Jewahri Idris
Yousuf, especially for thoughtful feedback and suggestions on the report.

Iris Gold Rodal, Gothenburg, October 2023

vii


Contents

List of Figures xi

List of Tables xv

1 Introduction 1
1.1 Aim and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Taxonomic Classification . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Human DNA Filtering . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 Metagenomic Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Plasmids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Genomic Reference Sequences and Taxonomy . . . . . . . . . . . . . 7
2.4 Kraken 2 - Taxonomic Classifier . . . . . . . . . . . . . . . . . . . . . 9
2.5 Bracken - Abundance Reestimation . . . . . . . . . . . . . . . . . . . 12

3 Methods 13
3.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Human Reference . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Taxonomic Classification - General Screening . . . . . . . . . . 16
3.1.3 Taxonomic Classification - Pathogen Subset Screening . . . . . 17

3.2 Simulated Metagenomic Datasets . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Many Species Dataset - ds-ln-high . . . . . . . . . . . . . . . . 18
3.2.2 Fewer Species Dataset - ds-ln-low . . . . . . . . . . . . . . . . 18
3.2.3 Human Spike Dataset - db-human . . . . . . . . . . . . . . . . 20
3.2.4 Pathogen Subset Dataset - ds-un-low . . . . . . . . . . . . . . 20
3.2.5 Parameter Optimization Dataset - ds-param-opt . . . . . . . . 21

3.3 Classification Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Experimental Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Results and Discussion 25
4.1 Human DNA Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Classifier, Parameters and Thresholds . . . . . . . . . . . . . . . . . . 28

4.2.1 Kraken 2 and Bracken Parameters . . . . . . . . . . . . . . . . 28
4.2.2 Post Classification Filtering . . . . . . . . . . . . . . . . . . . 34

4.3 Database Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Bacteria Only Databases . . . . . . . . . . . . . . . . . . . . . 36

ix


Contents

4.3.2 Plasmid Depleted Databases . . . . . . . . . . . . . . . . . . . 38
4.3.3 Database for Pathogen Subset . . . . . . . . . . . . . . . . . . 40

4.4 Metagenomic Sample Composition . . . . . . . . . . . . . . . . . . . 41
4.4.1 The Fewer Species Dataset ds-ln-low . . . . . . . . . . . . . . 41
4.4.2 The Pathogen Subset Dataset ds-un-low . . . . . . . . . . . . 42
4.4.3 The Many Speceis Dataset ds-ln-high . . . . . . . . . . . . . . 43
4.4.4 Reference Genome Considerations . . . . . . . . . . . . . . . . 44

5 Conclusion 47

Bibliography 49

A Article Dataset I

B Additional Results III

x


List of Figures

1.1 Main segments of the in silico pipeline for taxonomic classification
and identification of antimicrobial resistance of shotgun metagenomic
samples from the raw sequence files (FASTQ format). The samples
are pre-processed with quality control of the reads in the sequence
files, reads with human origin are thereafter removed in a filtering
step before the sample is passed to taxonomic classification where
the species in the sample are identified. The non-human reads in
the sample are assembled, which is where the mixed sequence frag-
ments are sorted and reassembled into the original coherent genome
sequences for each organism. Using the genome assemblies, genes as-
sociated with antimicrobial resistance can be found. The focus of this
thesis are the steps in red which are filtering of human host DNA and
taxonomic classification. . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 One of several possible metagenomic processes where DNA is ex-
tracted from a biological sample and converted to a digital format
by a sequence. The sequenced DNA data can then be processed to
extract information such as species of origin or genes. . . . . . . . . . 6

2.2 Sketches of bacteria and with plasmids. (a) The plasmid is the smaller
circular DNA element separate from the bacterial chromosome. (b)
Bacteria may have zero, one or multiple plasmids of different types
varying both intra and inter species. (c) Plasmid conjugation between
two bacteria where the plasmid in Bacterium 1 is transcribed and then
passed through a pilus temporary connecting ’tube’ to Bacterium 2. . 7

xi


List of Figures

2.3 (a) Nodes in a taxonomic tree where the vertical level represent ranks
together with an example of a ranked lineage for the bacterial species
Stphylococcus aureus rooted at the rank of domain and branching out
to the rank of strain. The taxonomic classification is limited to ranks
of species and above. (b) The taxonomic classifier Kraken 2 assigns
sequence reads to a node in the taxonomic tree. A read assigned to a
lower rank will by default also be assigned to all the directly ascending
nodes in the tree, however, the opposite is not true as reads can get
’stranded’ at higher taxonomic ranks if the classification dose not
reach the Kraken 2 confidence threshold for lower ranks. Note that
the figure is a simplified sketch and would in reality contain all the
ranks and nodes in the Kraken 2 reference databases used for the
the taxonomic classification. (c) Bracken is applied on sample level
after read assignment by a Kraken type classifier and it redistributes
stranded reads from higher to lower taxonomic nodes. Note that the
threshold for minimum number of prerequisite reads a node must
have for Bracken to assign extra is here set at ten reads, and as such
no stranded reads at the Staphylococcaceae node are redistributed to
the Macrococcus. (d) A redistribution of reads by Bracken results
in a reestimation of the sample abundance percentages, independent
across each taxonomic rank, which are displayed in parenthesis. . . . 11

3.1 Pipeline scaffold for human host filtering, taxonomic classification and
antibiotic resistance gene identification. . . . . . . . . . . . . . . . . . 14

4.1 Average percent of reads filtered out as human in the iss-human
dataset where (a) each sample independent of size has a theoreti-
cal composition of 10 % simulated human reads and 90 % bacterial
reads. The outer doughnut in (b) represents the average proportions
of reads that are filtered out by k2-human and classified or unclassi-
fied by k2-b-standard. The inner doughnut represents the reads that
do or do not align to the human genome using BWA-mem2 in relation
to the outer doughnut. It can be seen that the human aligned reads
are concentrated by the Kraken 2 human classified reads. . . . . . . . 26

4.2 Average species level classification metrics for samples in the dataset
ds-param-opt re-run for Kraken 2 confidence parameter settings be-
tween 0.0-0.2 on the x-axis and a Bracken read threshold between
10 and 1250 on the y-axis. The reference database used was the
db-standard. The highest F1.5 score was observed for a Kraken 2 con-
fidence of 0.075 in combination with a Bracken minimum number read
threshold of 750 reads. . . . . . . . . . . . . . . . . . . . . . . . . . . 30

xii


List of Figures

4.3 Mean [25 percentile, 75 percentile] classification metrics for Kraken
2 confidence and Bracken minimum number of reads threshold pa-
rameter settings of default (conf. = 0.00, min reads = 50), optimized
(conf. = 0.075, min reads = 750) and light (conf. = 0.05, min reads
= 10) across datasets ds-ln-high, ds-ln-low and ds-un-low run with
the db-standard reference database. (a) false discovery rates (FDR),
(b) true positive rates (TPR), (c) F1.5 scores and (e) F1 scores. . . . 32

4.4 Average (a) false discovery rates (FDR), (b) true positive rates (TPR)
and (c) F1.5 scores for the datasets ds-ln-high, ds-ln-low and ds-un-
low over read classification abundance filtering for each species. Note
that the x-axis are logarithmic. . . . . . . . . . . . . . . . . . . . . . 35

4.5 Mean [25 percentile, 75 percentile] (a) false discovery rates (FDR),
(b) true positive rates (TPR) and (c) F1.5 scores for the databases db-
standard, db-bac-comp, db-bac-comp-np, db-bac-rep and db-bac-rep-np
across datasets ds-ln-high, ds-ln-low and ds-un-low when classified
with default Kraken 2 and Bracken. . . . . . . . . . . . . . . . . . . . 38

B.1 Average (a) false discovery rates (FDR) Kraken 2 read cut-off, (b)
false discovery rates (FDR) Bracken read cut-off, (c) true positive
rates (TPR) Kraken 2 read cut-off, (d) true positive rates (TPR)
Bracken read cut-off, (e) F1.5 scores Kraken 2 read cut-off and (f) F1.5
scores Kraken 2 read cut-off, for the datasets ds-ln-high, ds-ln-low and
ds-un-low over read classification abundance filtering for each species.
The classification reference database used for (a)-(f) is the baseline
db-standard. Note that the x-axis are logarithmic. . . . . . . . . . . . VII

xiii


List of Figures

xiv


List of Tables

3.1 Databases for human DNA filtering, taxonomic classification of gen-
eral bacteria and taxonomic classification of a subset of pathogenic
bacteria. In the database names, Kraken 2 is abbreviated k2 and
BWA-mem 2 bwa and it indicates for which programs the databases
were constructed while the other databases are combined for Kraken
2 and Bracken referencing. . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Overview of simulated datasets used. . . . . . . . . . . . . . . . . . . 17
3.3 The iss-ln-rand dataset. Log-normal distribution of community frac-

tion taking into account that species have different genome sizes. The
species pool drawn from are represented in the reference databases
while the specific genome assembly specimens for those species are
strictly different from the genome assemblies used for building the
reference databases. Simulated paired end Illumina reads using In-
SilicoSeq with NovaSeq model. The samples have separate files for
forward and reverse strands. Note that the listed read count per
sample is for one strand. . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.4 The iss-path-subset dataset contains 44 samples with InSilicoSeq sim-
ulated paired end Illumina reads. One sample in group control-g4 was
assigned the wrong content when simulating and therefor excluded.
Uniform average genome coverage of 20x for all species in the sam-
ples. Five replicates per subgroup, ten species in each sample with
reads drawn for a uniform coverage distribution of 20X genome. Note
that the listed read count per sample is for one strand. . . . . . . . . 21

3.5 Composition of the db-param-opt dataset. . . . . . . . . . . . . . . . 22

4.1 Divided by sample group (5 replicates i each group) of the ds-human
dataset the average [min, max] percentages of reads classified by
Kraken 2 (K2) and bwa-mem 2 (BWA). The theoretical composition
of the samples are 10 % human and 90 % bacteria. . . . . . . . . . . 27

4.2 Species level taxonomic classification metrics F1.5 score, true positive
rate (TPR) and false discovery rate (FDR) (dataset sample mean [25
percentile, 75 percentile]) for different combinations of datasets and
general bacterial screening databases, repeated for three Kraken 2
and Bracken parameter settings (left most column). . . . . . . . . . . 33

xv


List of Tables

4.3 Species level taxonomic classification metrics F1, F1.5, true positive
rate (TPR) and false discovery rate (FDR) for the dataset ds-ln-low
run for the db-standard database with Kraken 2 confidence = 0.075
and Bracken read threshold = 750. The values listed are means of
the five replicates in each dataset group together with the minimum
and maximum [min, max] of the sample group. . . . . . . . . . . . . . 42

4.4 Species level taxonomic classification metrics F1, F1.5, true positive
rate (TPR) and false discovery rate (FDR) for the dataset ds-un-low
run for the db-standard database with Kraken 2 confidence = 0.075
and Bracken read threshold = 750. The values listed are means of the
five replicates in each dataset group together with the minimum and
maximum [min, max] of the sample group. Note that the group names
are in reference to the pathogen subset databases and all bacteria in
the db-standard were in this case considered for prediction. . . . . . . 43

A.1 Dataset from the article Evaluation of the Microba Community Pro-
filer for Taxonomic Profiling of Metagenomic Datasets From the Hu-
man Gut Microbiome by Parks et al. [1] adapted into the ds-ln-high
dataset. Table recreated from Table 2 in the article. . . . . . . . . . . II

B.1 Species level taxonomic classification metrics F1, true positive rate
(TPR) and false discovery rate (FDR) (dataset mean [25 percentile,
75 percentile]) for different combinations of datasets and general bac-
terial screening databases, repeated for three Kraken 2 and Bracken
parameter settings (left most column). . . . . . . . . . . . . . . . . . IV

B.2 Genus level taxonomic classification metrics F1.5 and F1 (dataset
mean [25 percentile, 75 percentile]) for different combinations of datasets
and general bacterial screening databases, repeated for three Kraken
2 and Bracken parameter settings (left most column). . . . . . . . . . V

B.3 Genus level taxonomic classification metrics true positive rate (TPR)
and false discovery rate (FDR) (dataset mean [25 percentile, 75 per-
centile]) for different combinations of datasets and general bacterial
screening databases, repeated for three Kraken 2 and Bracken param-
eter settings (left most column). . . . . . . . . . . . . . . . . . . . . . VI

B.4 Species level taxonomic classification of the ds-un-low dataset where
only the subset of 7 species were considered. The average [min, max]
number of true positive (TP), false positive (FP) and false negative
(FN) for the five replicates in each dataset group are listed together
with the theoretically expected number of positives (P) and negatives
(N). The dataset was run for the databases db-standard with Kraken 2
confidence = 0.075 and Bracken read threshold = 750 and the db-bac-
subset with Kraken 2 confidence = 0.3 and Bracken read threshold =
10,000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII

xvi


List of Tables

B.5 Species level taxonomic classification metrics F1, F1.5, true positive
rate (TPR) and false discovery rate (FDR) for the bacteria in the
dataset ds-ln-high run for the db-standard database with Kraken 2
confidence = 0.05 and Bracken read threshold = 50. The values
listed are means of the 10 replicates in each dataset group together
with the minimum and maximum [min, max] of the sample group. . . IX

B.6 Species level taxonomic classification metrics F1, F1.5, true positive
rate (TPR) and false discovery rate (FDR) for the bacteria in the
dataset ds-ln-high run for the db-bac-rep database with Kraken 2 con-
fidence = 0.05 and Bracken read threshold = 50. The values listed
are means of the 10 replicates in each dataset group together with
the minimum and maximum [min, max] of the sample group. . . . . . X

xvii


List of Tables

xviii


1
Introduction

Resistance to antibiotics among pathogenic bacteria is listed by the World Health
Organization as one of the largest threats to global health [2]. When antibiotics
stop working, previously easily treatable infections can become life-threatening to
individuals and a dire problem for healthcare systems. Infections can afflict anyone
however people who are immunocompromised, such as premature babies and organ
transplant recipients, are particularly at risk and in need of quick diagnostic methods
[3][4]. An important part of tracking and prevention of infectious disease outbreaks
in hospitals is identification of both the bacterial species and the specific genetic
components that can cause antibiotic resistance [5].
There are several different types of methods for identification of species, such as those
based on direct laboratory tests or DNA sequencing. An emerging DNA sequencing
based technology is shotgun metagenomic sequencing [6]. It is generally faster than
single isolate sequencing, where a time consuming cell cultivation step is needed,
and captures more information (such as resistance information) than the amplicon
metagenomic sequencing method does. In shotgun metagenomics all the DNA in
the sample is directly sequenced without prior cultivation and separating of species
[7]. However, the drawback of the method is that the bioinformatic analysis can
become complicated by having a mixture of sequences from different species and
abundances, as well as containing contaminants such as DNA from human hosts.

1.1 Aim and Scope
The context of this work is the need for fast and comprehensive methods for screening
for pathogens and associated antimicrobial resistance (AMR) in healthcare settings.
A broader goal is to develop an automated bioinformatic pipeline for analysis of
metagenomic samples, for example taken from patient’s bodily fluids, skin swabs or
hospital facilities. The main purpose of the pipeline is to take raw whole genome
shotgun metagenomic sequencing data as input, do analyses and output a report
with identified organisms and markers for AMR. The analysis steps in a minimal
proposed prototype can be divided into pre-processing of the sample data, removal
of human host DNA, taxonomic classification, genome assembly and identification
of antimicrobial resistance markers, shown schematically in Figure 1.1. The scope
of the project is however limited to implementing and evaluating two parts of the
prototype shotgun metagenomics pipeline, the taxonomic classification and filtering
of human DNA. A more in-depth description of the aims and scope pertaining to

1


1. Introduction

the selected parts are given in sections 1.1.1 and 1.1.2 below.
Bioinformatic analyses can be inaccessible and expensive, especially of large whole
genome sequenced metagenomic samples, as methods tend to be computationally
intensive and require high memory usage[8]. To make the findings more applicable,
the challenge is set for the pipeline to be adapted for running on a laptop (500
GB disc, 16 GB RAM and 8 CPUs). Furthermore, to achieve the functionalities,
building blocks of open software programs and algorithms are used and there is thus
no intention of constructing a fully original pipeline.

Seq.
file

Pre-
process

Human
filter

Tax.
class.

Genome
Assembly

AMR
ID

Result
report

Figure 1.1: Main segments of the in silico pipeline for taxonomic classification and
identification of antimicrobial resistance of shotgun metagenomic samples from the raw
sequence files (FASTQ format). The samples are pre-processed with quality control of
the reads in the sequence files, reads with human origin are thereafter removed in a
filtering step before the sample is passed to taxonomic classification where the species
in the sample are identified. The non-human reads in the sample are assembled, which is
where the mixed sequence fragments are sorted and reassembled into the original coherent
genome sequences for each organism. Using the genome assemblies, genes associated with
antimicrobial resistance can be found. The focus of this thesis are the steps in red which
are filtering of human host DNA and taxonomic classification.

1.1.1 Taxonomic Classification
Knowing which microorganisms, especially pathogens, are present in a sample and
which are not, can be central to treatment of infections or other medical interven-
tions [9]. The function of a taxonomic classification step is to identify and name the
organisms in the metagenomic sample based on unique properties of their genome
sequences. An advantage of shotgun metagenomics is the possibility of capturing
genome sequences of pathogens from all domains of life at the same time, however,
due to project limitations the investigation into the problem of taxonomic classifica-
tion is constrained to bacteria. Furthermore, the lowest taxonomic rank the bacteria
will be differentiated at is species and it is limited to previously characterized species.
Ideal taxonomic classification, within the scope of the project, is thus defined as cor-
rectly classifying all non-novel bacteria in a shotgun metagenomic sample down to
the species rank with zero false discoveries.
There are multiple factors involved in determining the difficulty level and success
of taxonomic classification. These include, but are not limited to, sample content,
taxonomic classifier and reference database. The aim is to describe the individ-
ual and combined effects of selected variables from each category and optimize for
classification performance.
The main aspects considered surrounding sample composition is how the classifica-
tion is effected by the number of species in the sample, similarity between species
and sequencing depth of the organisms in the sample. The scope is limited to short-
read samples sequenced using Illumina technologies, which is one of the commonly

2


1. Introduction

used sequencing methods for whole genome shotgun metagenomics [10].
The examination of taxonomic classifiers is focused on the K-mer based DNA-to-
DNA classification program Kraken 2 [11] and its companion program Bracken [12].
Kraken 2 has consistently been among the top performers in regards to computa-
tional time and classification recall, however lower on precision [8][13][1]. Bracken is
a tool for Bayesian reestimation of abundance after classification with Kraken and
has been shown to enhance the abundance estimates of the taxonomic classifications
[12]. The objective is to investigate if parameter tuning of the Kraken 2 and the
accompanying Bracken can improve classification performance.
Kraken 2 uses an indexed database of genetic reference sequences and for fast classi-
fication needs enough free memory to hold the database in memory or maps to disc
as a slower alternatively [11]. As memory space is limited, there will for a fixed file
size be a dichotomy between the amount of representation each species has and the
number of species in the database. The aim is to survey if custom built reference
databases can increase classification performance compared to a pre-built standard
within the size constraints of being capable to run on a laptop.

1.1.2 Human DNA Filtering
Collateral human DNA is expected to be included when doing shotgun metagenomic
sampling directly from a patient, for example of blood or skin and oral swabs [14].
Mixed human origin DNA is also likely to appear in varying amounts when sampling
from an environment such as hospitals [15]. In the context of a bioinformatic pipeline
intended for clinical application, it is therefor likely for incoming samples to contain
substantial amounts of human DNA if not removed in the laboratory.
The content of the human DNA is not of interest for taxonomic classification of
bacteria or identification of antibiotic resistance, however, storing and processing it
would consume unnecessary resources as well as have implications for privacy issues
as it can be linked to individuals. The bottleneck for computational resources is the
genome assembly step. Genome assembly, where the mixed sequence fragments in
the sample are sorted and reassembled into the original coherent genome sequences
for each organism, is needed for the identification of antibiotic resistance genes.
Depending on the program used for assembly, the time and resource usage can grow
fast when the sequence complexity grows, such as when species and genome sizes are
increased [16][17]. Filtering out human reads from the sample before assembly could
reduce the complexity as the human genome is large compared to microorganisms. It
is therefore relevant to investigate methods for filtering of human reads as a pipeline
step when working with limited memory and computational resources.
Before sequencing the metagenomic sample, physical human DNA fragments can be
removed or degraded using commercial human host depletion kits [14][18]. However,
the processes of human DNA elimination can have biased effects on the recovery of
microbial DNA for certain species due to their characteristics [14]. Filtering of hu-
man DNA reads in-silico, post sequencing, can be used as a compliment or preferred
alternative to the depletion kits depending on the percentage human DNA in the
sample and intended analysis [14]. There are established workflows for separating

3


1. Introduction

human reads from the microbial reads in the sample FASTQ file based on direct
alignment of reads to a human reference genome [19]. The k-mer based taxonomic
read classifier Kraken 2 has been shown to be comparably fast at classifying mi-
croorganisms in metagenomic samples [11]. It is proposed here to take advantage
of the speed of Kraken 2 for the extrapolated task of human read classification and
removal. The aim pertaining to a pipeline step for filtering out human DNA is to
investigate if Kraken 2 can be a faster alternative to alignment based methods while
quality wise remaining competitive. Specifically the questions to answer are: Can
human origin reads be filtered out from an Illumina short-read shotgun metagenomic
sample using the K-mer based taxonomic classifier Kraken 2? Is filtering by Kraken
2 comparable to a method using the Burrows-Wheeler aligner BWA-mem 2?

4


2
Theory

The intent of the chapter is to provide a theoretical background on the methodology
behind metagenomic sequencing data, reference databases and taxonomic classifica-
tion programs.

2.1 Metagenomic Sequencing
Metagenomic sampling is used when characterizing microbiomes in both the natu-
ral and human environment [20][21]. Metagenomics is also emerging in the context
of infectious disease detection for species identification and antimicrobial resistance
prediction [9][6]. A metagenomic sample is a biological sample that contains ge-
netic material from organisms, potentially multiple of different types, present in the
sampled environment. When a metagenomic sample is collected, by for example a
skin swab or tissue sample, first the genetic material, in this case DNA, is extracted
from the cells [22]. To be able to interpret the information carried by the physical
DNA strands they need to be converted into a readable form by a DNA sequencer.
The extracted DNA strands could be sequenced by targeting certain regions us-
ing amplicon methodology, or non-selectively in so called shotgun sequencing where
everything in the sample is sequenced [23].
The output from DNA sequenced with next generation sequencing technologies is
a digital read file, typically in the format .FASTQ, where each read of a DNA
fragment is a transcription into letters accompanied by Phred quality scores of each
DNA nucleotide [24]. The reads can be of differing length depending on the type
of sequencer. For example, Illumina short-read sequencers give read lengths of 50-
300 bp and read lengths of 150 bp are recommended by Illumina for whole-genome
sequencing [25][26]. Illumina sequencing can generate so called paired-end reads
where a fragment sequenced in both directions gives rise to two paired reads and a
sequenced sample can have two .FASTQ files, one for the reads of the forward strand
and one for the reverse strand [24]. The average sequencing error rates for Illumina
sequencers are less than 1 %, however, the errors are non-uniformly distributed
across the reads and furthermore depend on the specific illumina platform used [27].
Information can be gained directly from the read, such as some types of taxonomic
classification, or the reads can be assembled into the original genomes using the prop-
erty of reads overlapping along the genome [23]. Following from that the organisms
in the metagenomic sample are not separated before extraction and sequencing of
the DNA, the reads from all organisms in the sample are mixed together in the

5


2. Theory

.FASTQ file which is in contrast to sequencing of single isolate cultures where all
reads originate from one type of organism [22]. The relative abundance of cells from
a certain species in the physical sample will approximately carry over to the relative
abundance of reads from that species in the .FASTQ file, however, the read abun-
dance is also influenced by the length of the species’s genome. The average read
coverage of a species genome is the average number of reads covering a nucleotide
position in the genome. A metagenomic sample with a non-uniform distribution of
species abundances will result in some genomes of species being covered by fewer
reads than others, even below an average of 1X depending on the sequencing depth
of the sample [28]. To increase the read coverage of the lowest abundance species
the whole sample must be further sequenced which extends the sequencing time.

Figure 2.1: One of several possible metagenomic processes where DNA is extracted
from a biological sample and converted to a digital format by a sequence. The sequenced
DNA data can then be processed to extract information such as species of origin or genes.

6


2. Theory

2.2 Plasmids
A majority of a bacterium’s genes are located on its chromosome, however, a subset
of accessory genes may be found on a plasmid which is extrachromosomal circular
DNA [29], see Figure 2.2 (a). The functionalities of plasmids are diverse as well
as their size which ranges from hundreds of base pairs to over a million [30]. The
prevalence and type of plasmids varies between bacterial species but can also differ
within a species, as illustrated in Figure 2.2 (b). Similar as for the chromosome,
plasmids are replicated and passed on to daughter cells in association with cell di-
vision, however, they also carry genes for independent transmission and replication
[29]. Furthermore, plasmids are mobile genetic elements that can be horizontally
transferred, by so called conjugation, between prokaryotic cells in the same genera-
tion, see Figure 2.2 (c). Conjugation of plasmids can occur naturally between both
bacterial cells of the same species and of two different species [31]. Plasmids are
therefore a common vector for spread of genes associated with antibiotic resistance
properties [29]. The plasmid transfer is however not random as it may be induced
by environmental factors and is furthermore unequally distributed across bacterial
species [32]. For example, the F-type plasmids are mainly transferred among species
within the Enterobacteriaceae family group which includes pathogenic strains of Es-
cherichia and Salmonella [33].

(a) (b) (c)

Figure 2.2: Sketches of bacteria and with plasmids. (a) The plasmid is the smaller
circular DNA element separate from the bacterial chromosome. (b) Bacteria may have
zero, one or multiple plasmids of different types varying both intra and inter species. (c)
Plasmid conjugation between two bacteria where the plasmid in Bacterium 1 is transcribed
and then passed through a pilus temporary connecting ’tube’ to Bacterium 2.

2.3 Genomic Reference Sequences and Taxonomy
The National Center for Biotechnological Information (NCBI) is maintained by the
United States government and provides public access to biomedical and genomic in-

7


2. Theory

formation [34]. The NCBI online resources takes the form of reference databases, se-
quence analysis tools and repositories for research studies. NCBI genomic resources
include genomes, mainly of human associated microorganisms, and are organized in
databases with nucleotide sequences, genome assemblies and mapped annotations.
GenBank is an archival database of publicly available genetic sequences. It accepts
submissions of whole genome sequences and annotated genome assemblies in addi-
tion to for example transcriptome assemblies and targeted locus studies. [35] Until
the year 2022, 1,349,781 genome assemblies of cellular organisms were submitted
of which 91 % were from bacteria [36]. RefSeq is another sequence collection dis-
tributed by NCBI and it can be described as a non-redundant and curated version of
GenBank [37] [38]. In contrast to Genbank, the NCBI owns the records and can up-
date, combine and improve the annotations and sequences [38]. The two databases
have similar content but differ in format and how the sequences are annotated.
RefSeq contains genome assembly entries from all domains of life. For entries until
and including the year 2022 there are 331,892 assemblies of cellular organisms of
which 41 % is bacterial [36]. Furthermore, of the bacterial assemblies 33,364 are
on complete genome level and when atypical genomes are excluded there are 32,645
assemblies remaining. More studied organisms are overrepresented in the number of
assembly entries in RefSeq and the taxonomic resolution is often higher with entries
categorized on lower taxonomic levels such as sub-species and/or strain level. For
example a subset of seven pathogenic bacteria more closely examined in this thesis,
A. baumannii, E. faecium, E. fecalis, K. pneumoniae, P. aeruginosa, S. aureus and
S. epidermidis, make up 13.6 % of the complete level bacterial assemblies [36].
Every entry of a genome assembly in RefSeq is given a new unique RefSeq assem-
bly accession alongside an individual accession for each FASTA sequence in the
assembly. The assembly is tagged with its estimated level of completeness, where
the alternatives in increasing order are contig, scaffold, chromosome and complete
genome level. For a complete assembly of a bacterial organism the same number of
sequences in the FASTA file are expected as the combined number of chromosomes
and plasmids in the organism in question. The number and type of plasmids may
not be internally consistent between exemplar of the same species as there can be a
physiological variation in plasmid distribution.
A subset of the RefSeq assemblies are categorized by NCBI as Representative genomes,
also referred to as Reference genomes, based on their sequence annotation quality,
recognition as a community standard or medical importance [38][36]. As to not con-
fuse terminology, any genome can be used as a reference, for example when aligning
reads or building an indexed database, however, the RefSeq category refers to as-
semblies of particular importance and generally one per species. In exceptional cases
there may be more than one reference genome per prokaryote species, for example
E. coli has two. The number of bacterial reference genomes listed until 2022 are
16,765 spread across all assembly completeness levels. Another aspect of the RefSeq
database entries are their status. A genome assembly may be assigned a suppressed
status for reasons that include being replaced by another record to reduce redun-
dancy and removal by curators due to lack of support [37]. Suppressed records are
however still retrievable but have a marker.

8


2. Theory

The taxonomy system used by NCBI is a hierarchical organization of organism re-
lations with the main ranks of super kingdom/domain, phylum, class, order, family,
genus and species [39], sketched in Figure 2.3 (a). All NCBI records, such as genome
assemblies, are mapped to nodes in the taxonomic tree. For cellular organisms the
distance between two organisms across ranks in the taxonomic tree is not always
consistent with the actual genetic similarity between organisms and the taxonomic
tree is thus not a phylogentic tree [40]. To capture the variability there may be
extra ranks added, such as species complexes, sub-species and strains. With new
research discoveries the taxonomy is continuously updated by adding or merging
taxonomic nodes [39]. Furthermore changes of nomenclature conventions may also
trigger updates of the NCBI taxonomy. For example an inclusion of the phylum
rank in the International Code of Nomenclature of Prokaryotes lead to renaming or
new inclusion of 41 phyla in NCBI effecting millions of records in January of 2023
[41]. For evaluation of taxonomic classification it is thus important to correctly align
the versions of taxonomy and current scientific names used by the classifier and the
metadata of metagenomic sample.

2.4 Kraken 2 - Taxonomic Classifier
Kraken 2 is a faster and more memory efficient re-written version of the taxonomic
classification program Kraken [11]. In Kraken 2 each read in a FASTQ sequence
file of a metagenomic sample is classified independently (for short-read sequencing
a read is 150 nucleotides long). Simplified it is based on the querying of read
sub-sequences of length K against a specially built Kraken 2 database containing
genomic reference sequences associated with nodes in a taxonomic tree, see Figure
2.3 (b). Some K-mers will uniquely map to nodes at the lowest taxonomic levels
while others, by evolutionary relatedness or random chance, are shared by several
lower taxonomic nodes and are only unique for a higher taxonomic node, here called
a lowest common ancestor. For example, if the database was built with only one
reference genome all the K-mers of that genome sequence would be uniquely mapped
to its taxonomic node, if another genome sequence was added from a different node
then fewer K-mers would be uniquely mapping to the first taxonomic node and for
an infinite number of reference sequences no unique K-mers would be found.
In reality the Kraken 2 database is a probabilistic compact hash table where only
minimizers of the K-mers are stored, sub-strings L<K, and the actual querying of
the K-mers in the reads against the K-mers in the database is done by minimizers
on both ends using a spaced seed mask of length S<L [11]. The default Kraken 2
settings of the K-mer length (--kmer-len), minimizer length (--minimizer-len)
and spacer (--minimizer-spacer) is 35, 31 and 7 nucleotides respectively. Changes
in the parameter values and their internal ratios can effect classification speed, mem-
ory usage and sensitivity [11]. The K-mer length, minimizer length and minimizer
spaces are set in the database build step and are thereafter unchangeable when
running classification of a sample against that particular database.
In contrast to K-mer lengths, the confidence (--confidence) threshold parameter
can be set for each classification run and it represents the certainty of a read be-

9


2. Theory

longing to the assigned taxonomic node. The confidence score of a read label is
the fraction of the number of K-mers of the read that match unambiguously to a
lowest common ancestor taxonomic label in the reference database divided by the
total number of K-mers on the read [42]. The total number of K-mers for a read
is defined as including those that map to another label or are not present in the
database but excludes K-mers with an ambiguous nucleotide. The score threshold
has a non-probabilistic interpretation and can be specified in the interval [0, 1],
where zero is the default. If the score of an assigned read label falls below the set
confidence, the score of the next lowest common ancestor taxonomic node is calcu-
lated until the threshold is met. Increasing the confidence score setting will thus
result in a trade off of recall in favor of precision [43]. For example, if the threshold
was set at 0.1 and a given read scored 0.05 for species Mycobacterium tuberculosis,
0.15 for genus Mycobacterium and 0.18 for family Mycobacteriaceae, the read would
be labeled Mycobacterium which is the lowest common ancestor passing the thresh-
old. However, if the root of the taxonomic tree (domain/super kingdom) is reached
without a high enough confidence score the read is considered unclassified.

In a database build step prior to the hash table, the genomic reference sequence
accession numbers are mapped to the corresponding taxonomic ID number belong-
ing to the organism they originated from. Any reference sequences and taxonomic
system can be used to build custom Kraken 2 classification databases. However,
Kraken 2 is adapted for default import of the current NCBI taxonomy, nomencla-
ture and all genome sequence accessions mapped to taxonomic nodes. The NCBI
RefSeq database status for each sequence accession is checked in association with
the taxonomic mapping step and if the record has a suppressed status it is by
default excluded from the Kraken 2 database build. For convenience there are regu-
larly updated pre-built standard databases based on RefSeq complete genome level
assemblies that include a range of mainly human associated microorganism [44].
When databases of different contents are compared, generally the more reference
sequences with the higher variability used as a base for the database building gives
better classification performance [43][13]. For example, in a study the best perform-
ing database was based on all the nuclotide sequences in RefSeq which resulted in
a database size of >1TB [43].

When Kraken 2 is run, by default the entire Kraken 2 database is read into the
random access memory (RAM) which can be problematic if the reference database
is large. If instead the option --memory-mapping is used the database is instead read
from disk however the runtime will increase as a result. For control of the database
size, a limit of the final file size can be applied during the build step using the
argument --max-db-size [11]. The same reference sequences will be represented as
for the full size but with fewer than possible unique K-mers per sequence. In general,
limiting the database size (for the same genome sequences) reduces the classification
precision [43]. An alternative to limiting the database size in the build step is to
start with fewer genomic reference sequences, possibly by representing less variation
in for example each species or including genomes from fewer species in total.

10


2. Theory

(a) Taxonomic tree (b) Kraken 2 read assignment

(c) Bracken reestimation 1 (d) Bracken reestimation 2

Figure 2.3: (a) Nodes in a taxonomic tree where the vertical level represent ranks to-
gether with an example of a ranked lineage for the bacterial species Stphylococcus aureus
rooted at the rank of domain and branching out to the rank of strain. The taxonomic
classification is limited to ranks of species and above. (b) The taxonomic classifier Kraken
2 assigns sequence reads to a node in the taxonomic tree. A read assigned to a lower rank
will by default also be assigned to all the directly ascending nodes in the tree, however,
the opposite is not true as reads can get ’stranded’ at higher taxonomic ranks if the clas-
sification dose not reach the Kraken 2 confidence threshold for lower ranks. Note that
the figure is a simplified sketch and would in reality contain all the ranks and nodes in
the Kraken 2 reference databases used for the the taxonomic classification. (c) Bracken
is applied on sample level after read assignment by a Kraken type classifier and it redis-
tributes stranded reads from higher to lower taxonomic nodes. Note that the threshold
for minimum number of prerequisite reads a node must have for Bracken to assign extra
is here set at ten reads, and as such no stranded reads at the Staphylococcaceae node are
redistributed to the Macrococcus. (d) A redistribution of reads by Bracken results in a
reestimation of the sample abundance percentages, independent across each taxonomic
rank, which are displayed in parenthesis.

11


2. Theory

2.5 Bracken - Abundance Reestimation
Bracken is a program for Bayesian reestimation of abundances of metagenomic sam-
ples after classification with Kraken, Kraken 2 or KrakenUniq [12][45]. Bracken has
been shown to improve the overall classification performance in addition to abun-
dance estimation when used in combination with Kraken type classifiers compared
to use of only the classifier [8][43]. The tool is lightweight with runtimes less than a
minute[13]. However to run Bracken, a reference database is required and it needs
to be custom built for the specific Kraken, Kraken 2 or KrakenUniq database used
for the classification.
Because the Kraken type classifiers perform read level classification based on the
lowest common ancestor, a proportion of the reads can get stranded at nodes for
higher taxonomic ranks resulting in few reads at lower levels [12]. The problem is
accentuated for species in the database with high average nucleotide identity and can
thus result in uncertainty in the prediction of a samples species level composition.
Bracken solves the problem of low classification rates at lower taxonomic ranks by
probabilistic redistribution of ’stranded’ reads at nodes higher up in the taxonomic
tree down towards the species rank (or any other requested taxonomic rank) [12], see
Figure 2.3 (c) and (d). For a species node to receive additional reads it needs to be
directly descending to the higher taxonomic node and have a minimum number of
reads already assigned by the Kraken type classifier [45]. The threshold for minimum
number of reads can be adjusted using the --THRESHOLD parameter.

12


3
Methods

Various reference databases, metagenomic sample datasets and software programs
were used for the investigation of human DNA filtering and taxonomic classification.
In this chapter, first the content of the databases and datasets are described together
with the methods used to build them. Thereafter performance metrics for taxonomic
classification are defined and lastly the experimental runs are described in terms of
combinations of databases, datasets and classifiers used.
For orientation of where components place in the pipeline see the flowchart in Figure
3.1 where the main steps of a short-read analysis are displayed. A sample enters
the pipeline in the form of a .FASTQ file, or two in the case of paired-end reads.
The sample first is pre-processed by trimming low quality ends of reads and length
filtering. Human origin reads can be removed from the sample by either Kraken
2 K-mer matching or genome alignment with BWA-mem2 using human reference
databases. The remaining human free sample is past on to taxonomic classification
and antibiotic resistance gene identification, however where only the former is further
analyzed. The non-human reads are taxonomically classified with Kraken 2 using
a bacterial reference database and the output is passed to Bracken for abundance
re-estimation using a Kraken 2 complementary reference database. The results are
collected and compared to possible sample metadata containing a theoretical sample
species composition and known resistance genes. The pipeline output is a report
per sample in addition to dataset summaries if samples are run in batch.

13


3. Methods

Start

Sample
.FASTQ files

Sample
.FASTQ files

Pre-processing
Quality control,
read trimming

Filter type
Human filter

Kraken 2
classification

Kraken 2
GRCh38.p14

database

Human filter
BWA-mem 2

alignment

BWA
GRCh38.p14

database

Non-human
reads

Taxonomic
classification

Kraken 2

Kraken 2
Bacteria
database

Taxonomic read
assignment

Abundance
re-estimation

Bracken

Bracken
Bacteria
database

Tax. ID and
abundance est.

Result collection
and processing

Sample
metadata

Report

End

Genome
Assembly

Megahit

Contigs

Align genes
BLAST

Resistance
gene

database

Res. genes
and location

Figure 3.1: Pipeline scaffold for human host filtering, taxonomic classification and
antibiotic resistance gene identification.

14


3. Methods

3.1 Databases
A collection of reference databases were built for assessing the impact of the database
choice on human DNA filtering and the taxonomic classification of bacteria. An
overview of the databases and their properties can be found in Table 3.1. Two
databases, the Kraken 2 db-k2-human and the BWA-mem 2 db-bwa-human, were
used for evaluating human origin DNA filtering. For general bacterial screening a
prebuilt Kraken 2 and Bracken combined baseline database db-standard was com-
pared against two pairs of bacteria only combined Kraken 2 and Bracken databases,
where each set contains one with and one without plasmid sequences removed, db-
bac-comp and db-bac-comp-np and then db-bac-rep and db-bac-rep-np. In addition
the Kraken 2/Bracken db-bac-subset database containing only a subset of seven
pathogenic bacteria was built for screening specifically for a selection of bacteria.

Database Name Content Plas-
mids?

Gen-
omes

Size
cap?

Size
(GB)

db-k2-human Human - 1 no 4.6
db-bwa-human Human - 1 no 20.1

db-standard[44] Archaea, bacteria,
human, virus yes - yes 8.0

db-bac-comp Bacteria yes 33k yes 8.6
db-bac-comp-np Bacteria no 33k yes 8.6
db-bac-rep Bacteria yes 17k yes 8.0
db-bac-rep-np Bacteria no 17k yes 8.0
db-bac-subset Bacteria yes 5k no 0.4

Table 3.1: Databases for human DNA filtering, taxonomic classification of general
bacteria and taxonomic classification of a subset of pathogenic bacteria. In the database
names, Kraken 2 is abbreviated k2 and BWA-mem 2 bwa and it indicates for which
programs the databases were constructed while the other databases are combined for
Kraken 2 and Bracken referencing.

3.1.1 Human Reference
Both K-mer and alignment based methods, used for detection of human reads in a
sample, require an indexed database including human reference sequences. The same
human genome assembly GRCh38.p14 [46] from the Genome Reference Consortium
was used as the reference when building the K-mer based Kraken 2 database db-k2-
human and the alignment based BWA-mem 2 database db-bwa-human.
The db-k2-human was built from the assembly with default database building param-
eter settings of a K-mer lenght of 35 nucleotides, minimizer length of 31 nucleotides
and minimizer spaces of 7 nucleotides. Contrary to default settings, no masking
of the human genome regions containing low complexity sequences, such as regions
with repeated ’ACACACAC’ or ’AAAAAAA’, was performed before building the in-
dexes with the argument --no-mask. Furthermore, no memory capping was applied
to the build and the resulting database had the size of 4.6 GB.

15


3. Methods

For the db-bwa-human database each chromosome, and collected mitochondrial
DNA, in the human genome assembly was indexed separately by BWA-mem 2 before
being combined. The combined database size was 20.1 GB, although, as a result
of the chromosome split no individual file exceeded 1 GB and could therefore be
loaded entirely into memory when running on a laptop with 16 GB RAM.

3.1.2 Taxonomic Classification - General Screening
The baseline database was fetched from the Index zone collection of prebuilt Kraken
2 and Bracken combined databases and here named db-standard (original name
Standard-8, version 2022-12-09) [44]. The collection, which is regularly updated,
contains databases with various content and sizes, and is maintained by Ben Lang-
mead who is associated with the Kraken projects [44]. According to Index zone
the Standard-8 database was built using all complete level assemblies for bacteria,
archaea, plasmid and virus listed on RefSeq until 2022-12-9. The human genome
(GRCh38) was also included in the Standard-8 and its Kraken 2 library was cre-
ated with the --no-mask argument which means that low complexity regions of
the genome were not masked. The NCBI UniVec_Core library containing vector,
adapter, linker, and primer sequences were also included in the database. The
Standard-8 is the 8 GB size capped version of the Standard 64 GB database and
was constructed using the exact same genome assemblies, however with less of each
included, which was accomplished with the --max-db-size argument. Lastly, the
Kraken 2 complimentary Bracken database was added to the Standard-8 database.
Apart from no masking of low complexity regions of the human genome and size
capping, the Kraken 2 and Bracken databases were built with defaults settings.
The combined Kraken 2 and Bracken index database db-bac-comp was built using
only sequences from the bacterial domain. Sequences downloaded were all RefSeq
bacterial genome assemblies with assembly level of complete, recorded before 2022-
12-31, and with an exclusion of genomes tagged as atypical. Assemblies with a
RefSeq status of suppressed were automatically excluded by Kraken 2 during the
build step, which amounted to 5.3 % of the total number of downloaded sequences.
Default parameter settings were used in the Kraken 2 database build apart from
applying the argument --max-db-size. The size of the hash-map was intended to
be limited to 8 GB and the same size as the db-standard, however, the wrong gigabyte
definition was applied which resulted in a size of 8.6 GB. A Bracken database was
thereafter built as an attachment to the Kraken 2 database using default settings.
A combined Kraken 2 and Bracken bacterial database without plasmids, db-bac-
comp-np, was constructed using the same bacterial genome assemblies and parameter
settings as db-bac-comp with the exception that the plasmid sequences were excluded
from the sequence library before the database build step. The plasmids were removed
by searching and deleting sequences with the word ’plasmid’ in the sequence header
for each assembly FASTA file.
The databases db-bac-rep and db-bac-rep-np were built using the same methods as
the databases db-bac-comp and db-bac-comp-np respectively except with a different
set of genome sequences. The sequences used were bacterial assemblies in the Ref-

16


3. Methods

Seq category of Representative/Reference genomes listed up until 2022-12-31. The
assembly level of completeness ranged from contig to complete. No records were sup-
pressed out of the 923,148 individual sequences downloaded however 8 had uniden-
tified sequence accessions and were therefor excluded from the Kraken 2 database
build.

3.1.3 Taxonomic Classification - Pathogen Subset Screening
A reference database, db-bac-subset, for Kraken 2 with supplementary Bracken
database was built for a subset of 7 bacterial pathogens. The sequences used con-
sisted of all RefSeq complete level assemblies of the bacteria A. baumannii, E. fae-
cium, E. fecalis, K. pneumoniae, P. aeruginosa, S. aureus and S. epidermidis listed
until 2022-12-31. Of the downloaded sequences, 10.2 % were automatically excluded
from use in the databases as a result of having a NCBI suppressed status. The db-
bac-subset database was built using default settings including the default of no size
capping. Without size limitations the resulting database was 0.4 GB.

3.2 Simulated Metagenomic Datasets

Read simulation was used to create a set of metagenomic samples with known com-
position, for which taxonomic classification could be evaluated. Indeed the exact
sample composition, in terms of species and their proportions, was in it self required
in order to obtain high resolution classification performance metrics. The datasets
of raw read shotgun metagenomic samples used in this project were all generated by
simulation from already taxonomically profiled reference genomes. One dataset was
adapted from an 2021 article on the evaluation of taxonomic classifiers (Parks et al.
[1]) and four were simulated anew in the project using the illumina short-read sim-
ulation program InSilicoSeq (v.1.5.4) [47]. The samples in the InSilicoSeq datasets
(ds-ln-low, db-human, ds-un-low and ds-param-opt) were simulated using distinct
genome sequences from those used for building the reference databases, however,
there was overlap in source sequences between databases and the article adapted
dataset (ds-ln-high). An overview of the five datasets, that in total contain 250
samples, can be found in Table 3.2.

Dataset name Number of
samples

Sequence in
databases?

Bacterial
species per

sample
ds-ln-high [1] 140 yes 50-600
ds-ln-low 40 no 10-80
ds-human 20 no 20-40
ds-un-low 44 no 10
ds-param-opt 6 no 10-80

Table 3.2: Overview of simulated datasets used.

17


3. Methods

3.2.1 Many Species Dataset - ds-ln-high
The ds-ln-high consist of 140 simulated Illumina short-read (150 bp) samples of mock
prokaryotic communities adapted from Parks et al. [1]. The reads were simulated
using in part the same RefSeq genome assemblies as used for the reference database
build (db-standard). However, as the average nucleotide identity (ANI) to the ref-
erence genomes was modified in the simulation the authors estimated that using
in part the same original genome assemblies for databases and sample construction
would not significantly biaas the classification results.
The dataset was divided into 14 groups of 10 samples with replicate compositions,
meaning that the replicates were drawn from the same distributions for factors such
as species diversity, number of species and average nucleotide identity to closest
reference genome. The samples included mainly bacteria but also archaea with
number of species per sample draw from a normal distribution with either µ =
100, σ = 25 or µ = 500, σ = 100, log-normal species abundance distributions and
varying diversity in strains per species. The number of nucleotides simulated per
sample (2.1∗109 bp/sample) was independent of the number of species in the sample.
Following that the samples had a fixed number of reads and that the within sample
species abundances were log-normally distributed, some species in samples with a
high number of species had relative abundances as low as 1.9 ∗ 10−6%. For context,
a bacterial species, that on average has a genome size of 5 million bp [48], was on
the extreme low side represented in a sample by only 40 reads of 150 bp which
means a «1X coverage. For a display of all sample group variations see Table A.1
in Appendix A and Parks et al. original article for a full explanation on the dataset
simulation methodology.
Before the 140 samples were used the metadata was modified with new taxon names.
Since the article was published in 2021 the scientific names of individual species
have changed as well as major updates to phylum level names have occurred. The
metadata was therefore updated using the same NCBI taxonomy as was used for
building the classifier reference databases.

3.2.2 Fewer Species Dataset - ds-ln-low
The ds-ln-low dataset was created to cover a lower species complexity range in com-
bination with a higher mean read coverage per specie genome than the ds-ln-high.
As can be seen in Table 3.3, the dataset contains 40 metagenomic samples of log-
normal distributed bacterial communities, divided into 8 groups of number of speceis
and read coverage profiles. Each group has 5 replicate samples with a fixed num-
ber of species, 10, 20, 40 or 80, with a species average read coverage drawn from a
log-normal distribution with the normal µ either being 10X or 20X average genome
coverage. The randomness in the sampling process that effected the replicates came
from which species were drawn and their specific average read coverage. Sequences
used as simulation references were bacterial complete genome level assemblies in the
NCBI sequence archive Genbank that were not used for building the classification
reference databases. At the time of download (2023-03-17) there were 1076 bac-
terial species represented among the complete level assemblies, however, only 390

18


3. Methods

intersected with the species in the classification databases. The pool of genome
assembly specimens drawn from to make up a sample were thus belonging to those
390 species found both among the GenBank complete level assemblies (not used
for building classification databases) and the genome assemblies in the classification
databases.
As an example, a sample with 10 species and a coverage µ of 20X (group s10-c20
in Table 3.3) was generated by first sampling without replacement 10 species out
of the 390 available. The corresponding genome assemblies were then identified
for each of the 10 species and if there were more than one assembly per specie a
random specimen was selected. Thereafter, 10 values were sampled from a log-
normal distribution with µ = 20 and assigned one each to the species as an average
read coverage of the genome. The accessions of each sequence in the 10 assemblies
(for example chromosome 1, plasmid 1A, plasmid 1B, chromosome 2, plasmid 2A,
..., chromosome 10, plasmid 10A) were paired with their assigned average genome
coverage (for example 5.3X, 5.3X, 5.3X, 19.2X, 19.2X, ..., 22X, 22X) and given
as input to the InSilicoSeq read simulator. The result was that the plasmids and
chromosome of a bacteria had the same average coverage. Furthermore, sampling for
average read coverage per species, instead of number of reads per species, took into
consideration that species have differing genome sizes. InSilicoSeq then simulated
the sequencing of a physical shotgun metagenomic sample of the 10 species with
the listed community fractions. It generated double stranded 150 basepairs long
reads covering the genome assemblies the assigned average times (5.3X, 19,2, ...,
22X) while applying sequencing error patterns according to a model trained on the
Illumina NovaSeq sequencer. All the generated reads were then finally collected into
two FASTQ files, one for each strand. Note that the total number of reads generated
per metagenomic sample was a result of the input parameters (more species and
higher coverage resulted in more reads) and thus also the size of the FASTQ files.

Sample group Replicates No. species
per sample

Coverage
distribution

µ

Mean no.
reads per

sample
s10-c10 5 10 10x 2.2M
s10-c20 5 10 20x 4.0M
s20-c10 5 20 10x 4.6M
s20-c20 5 20 20x 9.0M
s40-c10 5 40 10x 7.0M
s40-c20 5 40 20x 15.4M
s80-c10 5 80 10x 20.1M
s80-c20 5 80 20x 31.5M

Table 3.3: The iss-ln-rand dataset. Log-normal distribution of community fraction
taking into account that species have different genome sizes. The species pool drawn from
are represented in the reference databases while the specific genome assembly specimens
for those species are strictly different from the genome assemblies used for building the
reference databases. Simulated paired end Illumina reads using InSilicoSeq with NovaSeq
model. The samples have separate files for forward and reverse strands. Note that the
listed read count per sample is for one strand.

19


3. Methods

3.2.3 Human Spike Dataset - db-human
The db-human dataset was generated by using 20 samples from the ds-ln-low dataset,
specifically the sample groups s20-c10, s20-c20, s40-c10 and s40-c20, as a bacterial
base and then spiking them with human reads. The human reads were added such
that the final composition of each sample was 10 % human and 90 % bacteria. The
two chromosome level assemblies used for generating human reads were the KO-
REF_S1v2.1 [49] assembly of a Korean man and the Ash1_v2.2 [50] assembly of an
Ashkenazi Jewish man. No value was placed on the specific ethnicity of the assem-
blies, but rather that they were from distinct individuals and of differing ethnicity
from the GRCh38.p14 assembly which was used in the classification and alignment
databases db-k2-human and db-bwa-human. From each assembly, 10 M NovaSeq
illumina paired end reads were generated using InSilicoSeq. The required number
of human reads was thereafter added to the bacterial samples by random draw from
the in total 20 M simulated read pairs (sampling with replacement between each
metagenomic sample).

3.2.4 Pathogen Subset Dataset - ds-un-low
A dataset was needed for testing if the content of classification databases could be
adjusted for better detection of a subset of pathogenic bacteria. The content and
structure of the ds-un-low dataset was thus designed in relation to the classification
database db-bac-subset. The classification database was built on genomes from seven
bacterial species, A. baumannii, E. faecium, E. fecalis, K. pneumoniae, P. aerugi-
nosa, S. aureus and S. epidermidis, which belong to the five genera Acinetobacter,
Enterococcus, Klebsiella, Pseudomonas, and Staphylococcus. The approach to the
dataset is to have samples with different amounts of species and genera overlapping
with the content of the databases.
To limit the size of the dataset, only two of the seven species, A. baumannii and
S. aureus, were chosen as bases for sample construction. In total, the dataset has
44 samples divided into nine groups with five replicates each (one group has four
samples due to a error in simulation). Each sample consists of a different set of 10
species with a fixed uniform average read coverage per genome of 20X.
As described in Table 3.4, in sample group a-baumannii-g0 all samples contain
A. baumannii, which is in the database db-bac-subset, plus nine random bacterial
species which are not in the database nor belong to a genus that is present in the
databases. The samples in group a-baumannii-g1 all contain A. baumanni, one
other random Acinetobacter species and then eight random bacterial species not in
the database or have a genus in the database. Last in the A. baumannii series is
group a-baumannii-g4 that contains samples with A. baumannii, four other random
Acinetobacter species plus five other random species not in the database or have
a genus in the database. A similar series arrangement was thereafter done for the
samples in groups s-aureus-g0, s-aureus-g1 and s-aureus-g4 with S. aureus as a
base. As S. epidermidis also belongs to the genus Staphylococcus it was excluded
from the genus draws for the S. aureus groups. In addition a negative control series
was made. In group control-g0 there is no overlap of species nor genera between the

20


3. Methods

samples and databases. Groups control-g1 and control-g4 consist of samples with
one or four random species from the genera in the databases, but not the specific
species, plus nine or six random species of a genera not in the database.

After the species composition of each sample was determined, corresponding refer-
ence genome assemblies were selected. With a list of accessions of the sequences in
the assemblies and a fixed average read coverage of 20X, the metagenomic samples
were then simulated using InSilicoSeq in same way as for ds-ln-low. As can be seen
in Table 3.4, even if the number of species and coverage is uniform, the group mean
of the number of reads per sample varies between 2.3M an 2.8M because of the
different genome lengths of the species in each sample.

Sample group Repli-
cates

Species
in DB

Genus
in DB

Species
not in

DB

Genus
not in

DB

Mean
no.

reads
control-g0 5 0 0 10 10 2.6M
control-g1 5 0 1 10 9 2.8M
control-g4 4 0 4 10 6 2.5M
a-baumannii-g0 5 1 1 9 9 2.7M
a-baumannii-g1 5 1 2 9 8 2.8M
a-baumannii-g4 5 1 5 9 5 2.5M
s-aureus-g0 5 1 1 9 9 2.3M
s-aureus-g1 5 1 2 9 8 2.4M
s-aureus-g4 5 1 5 9 5 2.4M

Table 3.4: The iss-path-subset dataset contains 44 samples with InSilicoSeq simulated
paired end Illumina reads. One sample in group control-g4 was assigned the wrong content
when simulating and therefor excluded. Uniform average genome coverage of 20x for all
species in the samples. Five replicates per subgroup, ten species in each sample with reads
drawn for a uniform coverage distribution of 20X genome. Note that the listed read count
per sample is for one strand.

3.2.5 Parameter Optimization Dataset - ds-param-opt

Extra metagenomic samples were created for parameter tuning of the taxonomic
classifier. The ds-param-opt dataset is a mixed collection of six InSilicoSeq simulated
samples similar to the samples of ds-ln-low and ds-un-low. The samples have 10-80
species randomly drawn from the same pool as for the ds-ln-low dataset with either
log-normal or fixed uniform abundance distributions. For the composition of each
sample see Table 3.5.

21


3. Methods

Sample Name No.
species

Average genome coverage No. reads

sample-ln-s10-c20 10 from log-normal dist. µ = 20x 2.9M
sample-ln-s15-c20 15 from log-normal dist. µ = 20x 1.7M
sample-un-s20-c15 20 fixed uniform of 15x 1.7M
sample-ln-s25-c10 25 from log-normal dist. µ = 10x 4.3M
sample-ln-s40-c15 40 from log-normal dist. µ = 15x 11.4M
sample-un-s80-c20 80 fixed uniform of 20x 20.4M

Table 3.5: Composition of the db-param-opt dataset.

3.3 Classification Metrics
When a metagenomic sample in the digital FASTQ format is passed through the
taxonomic classification step the simplified result is a list of predicted species (predic-
tions are also made at all other taxonomic ranks) and the number of reads classified
per species by Kraken 2 and then the re-estimated number by Bracken. As the sam-
ples used in this project were simulated, the true species composition was known
which is in contrast to real world samples taken from physical microbiomes. In the-
ory each simulated read fragment could be traced to its origin organism, however, in
this project the species metadata was on sample level with only the total number of
reads per species known. As such the classification performance was also measured
on sample level and not if each individual read fragment was correctly classified.
A True Positive (TP) observation is here defined as a species predicted by the Kraken
2 and Bracken combination that was actually in the sample, irregardless of if the
predicted and actual abundances (>0) were the same. The question of precision in
abundance prediction was not considered in the project scope. A False Positive (FP)
observation is defined as occurring when a species is predicted that was not actually
in the sample and reversely a False Negative (FN) is when a species actually in
the sample was not predicted by the taxonomic classifier. The True Negative (TP)
observations would theoretically be all the, potentially thousands, species in the
database not actually present in the sample and that were also not predicted to
be in the sample. It is common to apply a post classification filter to discount
predicted species below a certain threshold read count or abundance percentage
before calculating the TP, FP and FN [8]. However, the raw output from Bracken
was here considered the predicted species if not otherwise specified.
Classification metrics can be derived from the sum of the TP, FP and FN ob-
servations for a sample. The metrics applied here for measurement of taxonomic
classification performance were True Positive Rate (TPR), Positive Predictive Value
(PPV), False Discovery Rate (FDR) and F-score (Fβ). The true positive rate, Equa-
tion 3.1, also known as recall, is defined as the the number of true species found by
the classifier divided by the actual number of species in the sample.

TPR = TP

TP + FN
(3.1)

22


3. Methods

Positive Predictive Value, Equation 3.2, also know as precision, is defined as the
proportion of species that the classifier correctly predicts being in the sample over
the total number of species that are predicted.

PPV = TP

TP + FP
(3.2)

False Discovery Rate, Equation 3.3, is the number of species that the classifier falsely
predicts being in the sample divided by the total number of predicted species. The
false discovery rate is the inverse of the positive predictive value.

FDR = FP

TP + FP
= 1 − PPV (3.3)

The Fβ score, Equation 3.4, is a weighted mean of the true positive rate and positive
predictive value, where β is the factor of importance of the true positive rate in
relation to the positive predictive value.

Fβ = (1 + β2) ∗ TP

(1 + β2) ∗ TP + β2 ∗ FN + FP
(3.4)

The Fβ score for β = 1.5 was mainly used here, where the true positive rate was
weighted 1.5 times the positive predictive value. Although, the more commonly
used metric for comparing taxonomic classifiers is the F1 score and it was used in
addition to F1.5 [8].

3.4 Experimental Runs
In this project, a sample run refers to two FASTQ files with simulated paired-end
reads of a shotgun metagenomic sample being processed through the pipeline to-
gether with a metadata file on the samples theoretical content. Because the focus
of the investigation was on human DNA filtering and taxonomic classification, the
analysis steps surrounding genome assembly and identification of antibiotic resis-
tance genes were switched off to save time. However, minimal pre-processing in
the form of read quality trimming and filtering was applied for each run using the
program Fastp (v.0.23.2) [51].
The samples in the db-human dataset was run with both the Kraken 2 (v.2.1.2) and
BWA-mem 2 (v.2.2.1) human read filtering methods. The reads in the samples were
first classified with Kraken 2 (confidence parameter = 0.2) and the db-k2-human
reference database as either human or non-human. The non-human reads were
then passed on to taxonomic classification with the Kraken 2 (confidence parameter
= 0.075) and Bracken (v.2.8) (read threshold = 750 reads) using the db-standard
database. Thereafter all the reads, divided into the categories db-k2-human classi-
fied, db-standard classified and db-standard unclassified, were aligned to the human
genome using BWA-mem 2 and the db-bwa-human database. The total number
of reads aligned per sample and Kraken 2 classification categories could then be
established.

23


3. Methods

The three larger datasets, ds-ln-high, ds-ln-low and ds-un-low, were run through the
pipeline for taxonomic classification with Kraken 2 and Bracken using the default
parameter settings of Kraken 2 (confidence = 0.0) and Bracken (read threshold =
10 reads)1. The runs were repeated for each of the five bacterial screening databases
db-standard, db-bac-comp, db-bac-comp-np, db-bac-rep, db-bac-rep-np. Additionally,
even though the datasets theoretically do not include human genomes, they were all
run with prior Kraken 2 human filtration step (confidence parameter = 0.2).
The six samples in the ds-param-opt dataset were run repeatedly for Kraken 2
confidences of 0.0-0.2 and Bracken read thresholds of 10-1250 reads for the db-
standard reference database. The datasets ds-ln-high, ds-ln-low and ds-un-low were
thereafter re-run for all the screening databases with the combination of the Kraken
2 confidence (0.075) and Bracken read threshold (750 reads) which gave the highest
F1.5 score for iss-aram-opt. Furthermore the datasets ds-ln-high, ds-ln-low and ds-
un-low were run again for the five screening databases but with a more conservative
parameter value increase of Kraken 2 confidence = 0.05 and Bracken minimum
number reads = 50, compared to the optimized. Lastly the pathogen ds-un-low
dataset was taxonomically classified with the bacterial subset databas db-bac-subset
using Kraken 2 with confidence = 0.3 and Bracken with a minimum number read
threshold of 10,000 reads.

1The default minimum number of reads classified by Kraken 2 at a taxon node for Bracken to
redistribute them is set to 0 in the source code but stated as 10 in the documentation (v.2.8) [45].
The threshold of 10 reads is in this project used as the default.

24


4
Results and Discussion

4.1 Human DNA Filtering
Assembly of the genomes in the metagenomic sample, that is reconstructing the
genomes of the organisms from the reads fragments in the sample, is a step needed
before the search for antibiotic resistance genes. However, the process of assembling
genomes from a metagenomic sample is expensive in terms of computational time
and memory and it is therefore resourceful to not assemble reads that have a non-
bacterial origin, such as human, as they do not carry the resistance genes. Ideally
exactly all human reads would be removed and none of the reads with other origins,
however, in reality that is hard to achieve. The threshold could either be set to
miss some human reads or incorrectly filter out some bacterial reads. Arguably it
would be better to have the assembly take slightly longer time by removing most
but not all human reads compared to potentially removing bacterial reads carrying
important information.
The first approach to filtering out human reads was to do it in association with
taxonomic classification using Kraken 2 and the db-standard database as it contains
a human reference in addition to microorganisms. The filtration was unsatisfactory
as it was observed that a substantial proportion of the human reads were classi-
fied as microorganisms, mostly fungi, or were unclassified. Adjusting the Kraken 2
confidence parameter for better human filtration would however at the same time
impact the classification of bacteria. As the db-standard database was size restricted
when building, it was hypothesized that running against an unrestricted K-mer in-
dexation of the human genome, that is the db-k2-human database, could catch a
larger proportion of the human reads. Although, building a Kraken 2 database with
only one species is contrary to the principle of Kraken 2 classification based on a
database with distinct K-mers brought out by the internal uniqueness of K-mers
among the genomes used to build the database. To counter balance for only having
one species, and thus low variance, in the database, the Kraken 2 confidence pa-
rameter was increased from default of 0.0 to 0.2 which would require more human
K-mer matches per read in order for the read to be classified. In addition, apply-
ing a human read filtering step separate from the taxonomic classification enables
independent parameter settings.
An alternative method to K-mer based classification of human reads is direct align-
ment of reads to the human genome and it was tested for comparison purposes.
The BWA-mem 2 sequence alignment program was chosen because it is a widely

25


4. Results and Discussion

used short-read aligner and is among the top performers. bwa-mem was shown to
have a lower runtime per read than one of the closest competitors Bowtie 2, while
having similar alignment performances [52]. Thereafter, BWA-mem 2 was released
and shown to be faster than bwa-mem [53].

The theoretical composition of each of the 20 samples in the ds-human dataset is
10 % human origin reads and 90 % bacteria. The result of running the samples in
the dataset for the Kraken 2 database db-k2-human (confidence = 0.2) and then
db-standard (confidence = 0.075) on the non-human filtered reads was an average
9.98 % of the reads classified as human by db-k2-human, 60.52 % classified by db-
standard and 29.50 % were unclassified by db-standard, see Figure 4.1. When all the
reads in the samples then were aligned to the human genome with BWA-mem 2, on
average 9.33 % of the reads aligned. Moreover if the aligned reads were subdivided
into the Kraken 2 classification categories, as seen in Table 4.1, then 92.84 % of
the Kraken 2 db-k2-human classified reads aligned to human genome, 0.03 % of the
db-standard classified aligned and 0.16 % of the db-standard unclassified. As the
the human reads were not traced on read level one can not draw the immediate
conclusion that because 9.98 % of the reads in the samples were classified as human
by Kraken 2 that they truly were human reads. That on average 92.84 % of the
Kraken 2 human classified reads in turn also did align to the human genome using
BWA-mem 2 strengthens the indication that the majority of reads filtered out are
truly human. Another indication is that <1% of the reads classified by Kraken 2 to
not be human (either unclassified or classified as a microorganism) aligned to the
human genome using BWA-mem 2.

(a) Theoretical composition (b) Kraken 2 outer, bwa-mem 2 inner

Figure 4.1: Average percent of reads filtered out as human in the iss-human dataset
where (a) each sample independent of size has a theoretical composition of 10 % simulated
human reads and 90 % bacterial reads. The outer doughnut in (b) represents the average
proportions of reads that are filtered out by k2-human and classified or unclassified by
k2-b-standard. The inner doughnut represents the reads that do or do not align to the
human genome using BWA-mem2 in relation to the outer doughnut. It can be seen that
the human aligned reads are concentrated by the Kraken 2 human classified reads.

26


4. Results and Discussion

Sample
group % K2

human
filter of

total

% BWA
align to

human of
total

% BWA
align of

K2
human
filter

% BWA
align of

K2
classified

% BWA
align of
K2 un-

classified

s20-c10-h10 9.98
[9.97,9.99]

9.68
[9.53,9.78]

96.34
[95.04,97.44]

0.03
[0.02,0.03]

0.18
[0.09,0.38]

s20-c20-h10 9.97
[9.97,9.98]

9.37
[8.91,9.56]

93.21
[88.73,95.28]

0.03
[0.02,0.05]

0.18
[0.13,0.25]

s40-c10-h10 9.97
[9.97,9.98]

9.35
[9.13,9.56]

93.17
[91.10,95.34]

0.03
[0.02,0.03]

0.14
[0.10,0.20]

s40-c20-h10 9.98
[9.97,9.98]

8.91
[8.74,9.13]

88.66
[87.00,90.95]

0.03
[0.02,0.04]

0.14
[0.12,0.16]

Full dataset 9.98
[9.97,9.99]

9.33
[8.74,9.78]

92.84
[87.00,97.44]

0.03
[0.02,0.05]

0.16
[0.09,0.38]

Table 4.1: Divided by sample group (5 replicates i each group) of the ds-human dataset
the average [min, max] percentages of reads classified by Kraken 2 (K2) and bwa-mem 2
(BWA). The theoretical composition of the samples are 10 % human and 90 % bacteria.

The percentages of reads filtered with Kraken 2 and BWA-mem 2 are consistent
across sample with differing number of reads and number of other bacterial species
in the sample, which can be seen across the sample groups in Table 4.1. The result
is expected because the classification and alignment is independent for each read
and more or less other bacterial reads should not effect the human reads.
The parameter settings for Kraken 2 and BWA-mem 2 were not optimized for human
filtration and it is likely that both approaches could be equally as good at filtering
if a rigorous parameter optimization was performed. Kraken 2 however strongly
outperformed BWA-mem 2 for speed as it on average, for the 20 samples run, was
94 times faster.
Can the results be applied to real world samples from the human microbiome? The
main limitation is that the human reads in the test dataset were simulated and
were of a relatively high sequencing quality which would be expected to vary in
real samples. Furthermore, when simulating the human reads, the coverage was
uniformly distributed across the genome, which is unlikely for a real world human
genome. Also noteworthy is that the Kraken 2 confidence was set for classification
with the db-k2-human database for separating human reads from bacterial reads,
hence it is questionable if the high performance withholds for human filtering from
more similar types of organisms such as other Eukaryotes.
In a broader perspective, if a similar pipeline were to be used for example in a
veterinary setting then the genomes of animals in question would be used to build
the filtration database. This is relevant since the problem of antibiotic resistance
is present in agriculture too. It could possibly be easier to construct a Kraken 2
database with the animal species in question compared to development of a physical
host DNA extraction kit to apply pre-sequencing.

27


4. Results and Discussion

4.2 Classifier, Parameters and Thresholds
Kraken 2 is know for its speed and high recall for microbial classification but also
for its low precision [1][43]. These features were observed in this project, where
false discovery rates of higher than 90 % were common for default Kraken 2 and
Bracken parameter settings which meant that a majority of the species identified by
the classifier were false positives. What an acceptable false discovery rate is would
depend on the specific intended application of the classification results, however, a
rate of > 90 % is likely unacceptable for most clinical screening purposes. Imple-
menting measures to control the false discovery rate would generally also effect the
true positive rates. The goal was therefor to find an approach that reduced the false
discovery rate as much as possible without severely reducing the true positive rate.
The main classification metric used for harmonizing the two was the F1.5 score. The
F1.5 score, where true positive rate is weighted higher than the positive predictive
value, was chosen over the more common F1 score because it was considered more
important to not miss potentially dangerous pathogens than to have false alarms.
The intention was for the taxonomic classification step to be part of a fast screening
pipeline and if bacterial species of interest were found then further tests with higher
precision could possibly be done for increased certainty.
There are several approaches to the problem of lowering the false discovery rate.
For example, adjustments can be applied to parameters in the Kraken 2 read classi-
fication, which reads to be considered for abundance re-estimation by Bracken and
by setting post classification filtering thresholds of minimum estimated community
abundance or minimum number of estimated reads assigned to a taxon for it to
be considered found. Taking it further, the pre-processing steps before taxonomic
classification, such as the length and quality trimming, could also play a role but
were not considered within the project scope. Tuning of Kraken 2 and Bracken
parameters were first considered.

4.2.1 Kraken 2 and Bracken Parameters
The parameters considered were the read classification confidence for Kraken 2 and
the threshold for minimum number classified reads required for abundance reesti-
mation by Bracken. Other Kraken 2 parameters, such as K-mer length or minimizer
length, are also relevant for classification performance as was shown in the Kraken
2 method article [11]. These parameters are however tied to the Kraken 2/Bracken
database construction which in turn can be cumbersome to rebuild for a variety
of settings with limited computational resources. The confidence and read thresh-
old were on the other hand adjustable for each sample run without re-building the
reference databases.
Ideally a dataset with a large variety in sample complexity, distribution and species
diversity, would have been used for finding the optimal parameter settings for a
general metagenomic sample or a sub-group of samples. However, due to restrictions
in time and sample simulation abilities the number of samples in the dataset used
for parameter optimization ds-param-opt was held low and with fewer species. It is

28


4. Results and Discussion

important to note that the six sample dataset ds-param-opt used for the optimization
was not a representation of all possible shotgun metagenomics samples, and contains
samples with lower species diversity than for example the ds-ln-high and simulated
from a pool of only 390 species. The motivation behind preforming an optimization
was not to find the universal optimal metagenomic parameter settings but rather to
demonstrate effects of varied parameter settings for the same dataset in relation to
the default.

The samples in the ds-param-opt dataset were taxonomically classified using the
db-standard database for a range of Kraken 2 confidence settings and thresholds for
Bracken minimum number of reads required for read redistribution. In Figure 4.2
the average positive predictive value, true positive rate, F1 and F1.5 scores for the
samples in the dataset are shown for Kraken 2 confidences [0.0, 0.2] and Bracken
read thresholds [10, 1250]. The results are only in relation to bacteria and possible
false positives from other domains were not include. For default settings, the dataset
average true positive rate was 100 % which means that all species in the samples were
identified. The positive predictive value was however only 4 % which means that 96%
of the species predicted by the classifier were false positives. The resulting average
F1.5 score was 12 %. Increasing either the confidence or the read threshold increased
the positive predictive value and decreased the true positive rate. Applying both
a high confidence (0.2) and read threshold (1250 reads) instead gave an average
positive predictive value of 100 % however with the cost of decreasing the true
positive rate to 21 %, resulting in a F1.5 score of 28 %. The F1.5 optimized parameter
settings were a Kraken 2 confidence of 0.075 and Bracken read threshold of 750 reads
which gave an average F1.5 score of 80 %. It was thus clear that the default parameter
settings were not optimal for the ds-param-opt dataset.

A reason for why false positive observations occur, aside from database misanno-
tation, is because a portion of the reads from a species truly in the sample have
a higher K-mer match to other closely related species. If a species truly in the
sample is in high abundance then the proportion of reads originating from it that
are misclassified can appear significant even if the classification error rate is low on
read level. Furthermore, the misclassifications have been shown to be systematic
where finding a certain species in high abundance is correlated to also falsely finding
specific species in low abundance [54]. The systematic error was more prevalent in
Kraken 2 compared to marker gene type taxonomic classifiers such as MetaPhlAn
[54]. That there are systematic errors are however not surprising since the species
have varying levels of phylogenetic relatedness. The problem of systematic misclas-
sifications could then worsen if the abundance of the false positives are artificially
inflated by Bracken’s redistribution of reads at higher taxonomic levels to lower
levels. A threshold for Bracken abundance reestimation in the form of a minimum
fraction of reads classified by Kraken 2 as a specific taxon could be more scalable
than using an absolute number of reads. If a choice had to be made between ad-
justing Kraken 2 or Bracken, arguably the Kraken 2 confidence would be increased
form default because its effects are on read level and are independent of the number
of species or the number of reads in the sample.

29


4. Results and Discussion

(a) Mean PPV (precision) (b) Mean TPR (recall)

(c) Mean F1 (d) Mean F1.5

Figure 4.2: Average species level classification metrics for samples in the dataset ds-
param-opt re-run for Kraken 2 confidence parameter settings between 0.0-0.2 on the x-axis
and a Bracken read threshold between 10 and 1250 on the y-axis. The reference database
used was the db-standard. The highest F1.5 score was observed for a Kraken 2 confidence
of 0.075 in combination with a Bracken minimum number read threshold of 750 reads.

The samples in the larger ds-ln-high, ds-ln-low and ds-un-low datasets were tax-
onomically classified with the Kraken 2 and Bracken combination using different
databases for the default parameter settings and then re-run for ’optimized’ param-
eter settings (confidence = 0.075, read threshold = 750). In Table 4.2 the resulting
dataset averages of species F1.5, true positive rates and false discovery rates are listed
for five databases while Figure 4.3 graphically shows trends only for the baseline db-

30


4. Results and Discussion

standard database.1 For default Kraken 2/Bracken parameters and the db-standard
database, the average F1.5 scores were 27.1 %, 10.6 % and 10.6 % for the ds-ln-high,
ds-ln-low and ds-un-low datasets respectively. When instead optimized parameters
were used for classification the average scores increased to 53.1 %, 77.4 % and 82.2
% for the three datasets. That the relative improvement was larger for the ds-ln-low
and ds-un-low datasets, compared to ds-ln-high, was not surprising because they
were more similar to the dataset used for F1.5 optimization in terms of species diver-
sity, sequencing depth and program used for simulation. Furthermore, the datasets
average false discovery rates decreased from 88.0 %, 96.5 % and 96.5 % for the de-
fault parameters to 10.3 %, 24.2 % and 18.3 % for the optimized. In contrast, the
largest decrease was observed for the ds-ln-high, however, the true positive rate also
decreased the most from 76.4 % to 46.0 %. A probable reason for both lower true
positive rates and false discovery rates for the ds-ln-high is that the samples have
a higher proportion of species with low genome sequence coverage (less than the
Bracken 750 reads threshold) compared to the ds-ln-low and ds-un-low.

As the exact same sample datasets were used in each experimental runs, the averages
of classification performance metrics can be compared between different databases
and parameter settings, however it is important to note that the spread across sam-
ples within a dataset can be large, as can be observed in Table 4.2 where approxi-
mately 10-20 percentage points differ between the 25 % and 75 % percentiles. Also
noteworthy is that the average classification performance of samples in a dataset
would not be relevant when for example screening a patient and a confidence inter-
val on performance for a certain type of sample would be more applicable.

For a future unknown metagenomic sample, with possibly a low sequencing depth,
using a Kraken 2 confidence 0.075 and Bracken read threshold of 750 reads which
was optimal for the ds-param-opt dataset could be too aggressive. As an example
take the the sharp decrease in true positive rate for the ds-ln-high. For the ds-param-
opt, the mean F1.5 increased steeply for increased Bracken read thresholds between
10 and 250 reads (not shown in the results) and it was seen that a small change
in parameter values could make a difference. It was therefore hypothesized that a
smaller parameter adjustment could increase the F1.5 score while being ’safe’ for more
diverse types of sample compositions. The datasets were re-run again for parameter
settings of Kraken 2 confidence 0.05 and Bracken read threshold of 50 reads. The
results were that the ds-ln-high performed better while the ds-ln-low and ds-un-
low performed worse compared to classification with ’optimal’ parameter settings.
Classification with a conservative parameter adjustment however performed better
for F1.5 than with default settings for all the datasets.

1Additional F1 scores are listed in Appendix B together with the genus level classification
performance for each combination of parameter setting, dataset and database.

31


4. Results and Discussion

(a) False discovery rate (b) True positive rate

(c) F1.5 score (d) F1 score

Figure 4.3: Mean [25 percentile, 75 percentile] classification metrics for Kraken 2 con-
fidence and Bracken minimum number of reads threshold parameter settings of default
(conf. = 0.00, min reads = 50), optimized (conf. = 0.075, min reads = 750) and light
(conf. = 0.05, min reads = 10) across datasets ds-ln-high, ds-ln-low and ds-un-low run
with the db-standard reference database. (a) false discovery rates (FDR), (b) true positive
rates (TPR), (c) F1.5 scores and (e) F1 scores.

The conclusion is that the default Kraken 2 confidence and Bracken read threshold
may be far from optimal. What is optimal parameter settings can vary depending
on sample type and that for future implementation it could be good to optimize
specifically for the range of sequence coverage, species number and diversity that is
relevant for the use case. For example if the pipeline would be used for identifying
bacteria in skin swabs, samples from a skin microbiome could be simulated and
used for a parameter optimization. A conservative threshold increase from default
to confidence of 0.05 and read threshold of 50 reads could be a compromise for
unknown samples.

32


4. Results and Discussion

Dataset Database F1.5 (%) TPR (%) FDR (%)
D

ef
au

lt
(c

on
f.

0,
m

in
re

ad
s

10 ds-ln-
high

db-standard 27.1 [14.6,40.0] 76.3 [71.5,89.6] 88.0 [81.8,94.8]
db-bac-comp 25.3 [13.1,37.3] 75.1 [69.7,89.1] 89.0 [83.6,95.5]
db-bac-comp-np 24.2 [12.6,35.6] 75.1 [69.7,89.1] 89.7 [84.6,95.7]
db-bac-rep 24.5 [14.2,37.6] 91.5 [88.5,97.8] 90.0 [83.7,95.1]
db-bac-rep-np 24.0 [13.9,36.8] 91.5 [88.5,97.9] 90.3 [84.1,95.2]

ds-ln-
low

db-standard 10.6 [8.7,11.4] 98.7 [98.4,100.0] 96.5 [96.2,97.2]
db-bac-comp 10.5 [8.6,11.4] 99.5 [100.0,100.0] 96.5 [96.2,97.2]
db-bac-comp-np 9.5 [7.8,10.2] 99.5 [100.0,100.0] 96.9 [96.6,97.5]
db-bac-rep 5.0 [4.0,5.7] 93.4 [90.0,97.8] 98.4 [98.2,98.7]
db-bac-rep-np 4.8 [3.9,5.6] 93.4 [90.0,97.8] 98.4 [98.2,98.8]

ds-un-
low

db-standard 10.6 [8.0,11.8] 99.5 [100.0,100.0] 96.5 [96.1,97.4]
db-bac-comp 10.3 [8.2,11.7] 100.0 [100.0,100.0] 96.6 [96.1,97.3]
db-bac-comp-np 9.3 [7.1,10.9] 100.0 [100.0,100.0] 96.9 [96.4,97.7]
db-bac-rep 5.1 [3.9,5.9] 95.2 [90.0,100.0] 98.4 [98.1,98.8]
db-bac-rep-np 4.9 [3.8,5.6] 95.2 [90.0,100.0] 98.4 [98.2,98.8]

O
pt

im
al

(c
on

f.0
.0

75
,

m
in

re
ad

s
75

0)

ds-ln-
high

db-standard 53.1