Taxonomic Classification of Bacteria in Shotgun Metagenomic Samples Evaluation of Species Level Classification in the Context of Pathogen Screening for Healthcare Applications Master’s thesis in Engineering Mathematics and Computational Science IRIS GOLD RODAL DEPARTMENT OF MATHEMATICAL SCIENCES CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2023 www.chalmers.se Master’s thesis 2023 Taxonomic Classification of Bacteria in Shotgun Metagenomic Samples Evaluation of Species Level Classification in the Context of Pathogen Screening for Healthcare Applications IRIS GOLD RODAL Department of Mathematical Sciences Division of Mathematical Statistics Chalmers University of Technology Gothenburg, Sweden 2023 Taxonomic Classification of Bacteria in Shotgun Metagenomic Samples Evaluation of Species Level Classification in the Context of Pathogen Screening for Healthcare Applications IRIS GOLD RODAL © IRIS GOLD RODAL, 2023. Supervisor: Oscar Aspelin, Bioinformatician at 1928 Diagnostics Examiner: Erik Kristiansson, Professor in Biostatistics and Bioinformatics at Chalmers University of Technology Master’s Thesis 2023 Department of Mathematical Sciences Division of Mathematical Statistics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 iv Taxonomic Classification of Bacteria in Shotgun Metagenomic Samples Evaluation of Species Level Classification in the Context of Pathogen Screening for Healthcare Applications IRIS GOLD RODAL Department of Mathematical Sciences Chalmers University of Technology Abstract The aim was to investigate taxonomic classification and removal of human host DNA in the context of a bioinformatic analysis pipeline for screening of pathogens. The examination was carried out using simulated short-read sequenced shotgun metagenomic samples. It was found that a majority of human origin DNA could be separated from bacteria using the K-mer based read classifier Kraken 2 with a custom built human only reference database. Effects on taxonomic classification performance were surveyed for variations in sample composition, parameter settings of the taxonomic classifier and reference database composition. Maintaining both high precision and recall for species level taxonomic classification of metagenomic samples was challenging for limited computational resources. A one-size-fits-all ap- proach to taxonomic classification of any shotgun metagenomic sample would be near impossible with the tested K-mer based classifiers (Kraken 2 and Bracken) and instead specialized pipeline tracks optimized for different expected range of species, sequencing depth and abundance distributions could be a solution. Keywords: metagenome taxonomic classification, shotgun metagenomics, WGS, Kraken 2, Bracken, taxonomic classifier, host removal. v Acknowledgements This endeavor would not have been possible without collaboration with 1928 Diag- nostics, who provided me with the opportunity to carry out my thesis project at the company and I am thankful for the resources they provided. First and foremost I would like to express my gratitude to my supervisor Oscar Aspelin for sharing expert bioinformatics knowledge and for his continued guidance and encouragement throughout the year. I am further thankful for the support from the development team at 1928 Diagnostics, who, in addition to helpful coding tips and tricks, gave me much appreciated insight into how problems similar to the one in this thesis are tackled for real world applications. I would also like to direct many thanks my ex- aminer Erik Kristiansson for taking on this project and my opponent Jewahri Idris Yousuf, especially for thoughtful feedback and suggestions on the report. Iris Gold Rodal, Gothenburg, October 2023 vii Contents List of Figures xi List of Tables xv 1 Introduction 1 1.1 Aim and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Taxonomic Classification . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Human DNA Filtering . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Metagenomic Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Plasmids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Genomic Reference Sequences and Taxonomy . . . . . . . . . . . . . 7 2.4 Kraken 2 - Taxonomic Classifier . . . . . . . . . . . . . . . . . . . . . 9 2.5 Bracken - Abundance Reestimation . . . . . . . . . . . . . . . . . . . 12 3 Methods 13 3.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 Human Reference . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.2 Taxonomic Classification - General Screening . . . . . . . . . . 16 3.1.3 Taxonomic Classification - Pathogen Subset Screening . . . . . 17 3.2 Simulated Metagenomic Datasets . . . . . . . . . . . . . . . . . . . . 17 3.2.1 Many Species Dataset - ds-ln-high . . . . . . . . . . . . . . . . 18 3.2.2 Fewer Species Dataset - ds-ln-low . . . . . . . . . . . . . . . . 18 3.2.3 Human Spike Dataset - db-human . . . . . . . . . . . . . . . . 20 3.2.4 Pathogen Subset Dataset - ds-un-low . . . . . . . . . . . . . . 20 3.2.5 Parameter Optimization Dataset - ds-param-opt . . . . . . . . 21 3.3 Classification Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.4 Experimental Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4 Results and Discussion 25 4.1 Human DNA Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Classifier, Parameters and Thresholds . . . . . . . . . . . . . . . . . . 28 4.2.1 Kraken 2 and Bracken Parameters . . . . . . . . . . . . . . . . 28 4.2.2 Post Classification Filtering . . . . . . . . . . . . . . . . . . . 34 4.3 Database Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.1 Bacteria Only Databases . . . . . . . . . . . . . . . . . . . . . 36 ix Contents 4.3.2 Plasmid Depleted Databases . . . . . . . . . . . . . . . . . . . 38 4.3.3 Database for Pathogen Subset . . . . . . . . . . . . . . . . . . 40 4.4 Metagenomic Sample Composition . . . . . . . . . . . . . . . . . . . 41 4.4.1 The Fewer Species Dataset ds-ln-low . . . . . . . . . . . . . . 41 4.4.2 The Pathogen Subset Dataset ds-un-low . . . . . . . . . . . . 42 4.4.3 The Many Speceis Dataset ds-ln-high . . . . . . . . . . . . . . 43 4.4.4 Reference Genome Considerations . . . . . . . . . . . . . . . . 44 5 Conclusion 47 Bibliography 49 A Article Dataset I B Additional Results III x List of Figures 1.1 Main segments of the in silico pipeline for taxonomic classification and identification of antimicrobial resistance of shotgun metagenomic samples from the raw sequence files (FASTQ format). The samples are pre-processed with quality control of the reads in the sequence files, reads with human origin are thereafter removed in a filtering step before the sample is passed to taxonomic classification where the species in the sample are identified. The non-human reads in the sample are assembled, which is where the mixed sequence frag- ments are sorted and reassembled into the original coherent genome sequences for each organism. Using the genome assemblies, genes as- sociated with antimicrobial resistance can be found. The focus of this thesis are the steps in red which are filtering of human host DNA and taxonomic classification. . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 One of several possible metagenomic processes where DNA is ex- tracted from a biological sample and converted to a digital format by a sequence. The sequenced DNA data can then be processed to extract information such as species of origin or genes. . . . . . . . . . 6 2.2 Sketches of bacteria and with plasmids. (a) The plasmid is the smaller circular DNA element separate from the bacterial chromosome. (b) Bacteria may have zero, one or multiple plasmids of different types varying both intra and inter species. (c) Plasmid conjugation between two bacteria where the plasmid in Bacterium 1 is transcribed and then passed through a pilus temporary connecting ’tube’ to Bacterium 2. . 7 xi List of Figures 2.3 (a) Nodes in a taxonomic tree where the vertical level represent ranks together with an example of a ranked lineage for the bacterial species Stphylococcus aureus rooted at the rank of domain and branching out to the rank of strain. The taxonomic classification is limited to ranks of species and above. (b) The taxonomic classifier Kraken 2 assigns sequence reads to a node in the taxonomic tree. A read assigned to a lower rank will by default also be assigned to all the directly ascending nodes in the tree, however, the opposite is not true as reads can get ’stranded’ at higher taxonomic ranks if the classification dose not reach the Kraken 2 confidence threshold for lower ranks. Note that the figure is a simplified sketch and would in reality contain all the ranks and nodes in the Kraken 2 reference databases used for the the taxonomic classification. (c) Bracken is applied on sample level after read assignment by a Kraken type classifier and it redistributes stranded reads from higher to lower taxonomic nodes. Note that the threshold for minimum number of prerequisite reads a node must have for Bracken to assign extra is here set at ten reads, and as such no stranded reads at the Staphylococcaceae node are redistributed to the Macrococcus. (d) A redistribution of reads by Bracken results in a reestimation of the sample abundance percentages, independent across each taxonomic rank, which are displayed in parenthesis. . . . 11 3.1 Pipeline scaffold for human host filtering, taxonomic classification and antibiotic resistance gene identification. . . . . . . . . . . . . . . . . . 14 4.1 Average percent of reads filtered out as human in the iss-human dataset where (a) each sample independent of size has a theoreti- cal composition of 10 % simulated human reads and 90 % bacterial reads. The outer doughnut in (b) represents the average proportions of reads that are filtered out by k2-human and classified or unclassi- fied by k2-b-standard. The inner doughnut represents the reads that do or do not align to the human genome using BWA-mem2 in relation to the outer doughnut. It can be seen that the human aligned reads are concentrated by the Kraken 2 human classified reads. . . . . . . . 26 4.2 Average species level classification metrics for samples in the dataset ds-param-opt re-run for Kraken 2 confidence parameter settings be- tween 0.0-0.2 on the x-axis and a Bracken read threshold between 10 and 1250 on the y-axis. The reference database used was the db-standard. The highest F1.5 score was observed for a Kraken 2 con- fidence of 0.075 in combination with a Bracken minimum number read threshold of 750 reads. . . . . . . . . . . . . . . . . . . . . . . . . . . 30 xii List of Figures 4.3 Mean [25 percentile, 75 percentile] classification metrics for Kraken 2 confidence and Bracken minimum number of reads threshold pa- rameter settings of default (conf. = 0.00, min reads = 50), optimized (conf. = 0.075, min reads = 750) and light (conf. = 0.05, min reads = 10) across datasets ds-ln-high, ds-ln-low and ds-un-low run with the db-standard reference database. (a) false discovery rates (FDR), (b) true positive rates (TPR), (c) F1.5 scores and (e) F1 scores. . . . 32 4.4 Average (a) false discovery rates (FDR), (b) true positive rates (TPR) and (c) F1.5 scores for the datasets ds-ln-high, ds-ln-low and ds-un- low over read classification abundance filtering for each species. Note that the x-axis are logarithmic. . . . . . . . . . . . . . . . . . . . . . 35 4.5 Mean [25 percentile, 75 percentile] (a) false discovery rates (FDR), (b) true positive rates (TPR) and (c) F1.5 scores for the databases db- standard, db-bac-comp, db-bac-comp-np, db-bac-rep and db-bac-rep-np across datasets ds-ln-high, ds-ln-low and ds-un-low when classified with default Kraken 2 and Bracken. . . . . . . . . . . . . . . . . . . . 38 B.1 Average (a) false discovery rates (FDR) Kraken 2 read cut-off, (b) false discovery rates (FDR) Bracken read cut-off, (c) true positive rates (TPR) Kraken 2 read cut-off, (d) true positive rates (TPR) Bracken read cut-off, (e) F1.5 scores Kraken 2 read cut-off and (f) F1.5 scores Kraken 2 read cut-off, for the datasets ds-ln-high, ds-ln-low and ds-un-low over read classification abundance filtering for each species. The classification reference database used for (a)-(f) is the baseline db-standard. Note that the x-axis are logarithmic. . . . . . . . . . . . VII xiii List of Figures xiv List of Tables 3.1 Databases for human DNA filtering, taxonomic classification of gen- eral bacteria and taxonomic classification of a subset of pathogenic bacteria. In the database names, Kraken 2 is abbreviated k2 and BWA-mem 2 bwa and it indicates for which programs the databases were constructed while the other databases are combined for Kraken 2 and Bracken referencing. . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Overview of simulated datasets used. . . . . . . . . . . . . . . . . . . 17 3.3 The iss-ln-rand dataset. Log-normal distribution of community frac- tion taking into account that species have different genome sizes. The species pool drawn from are represented in the reference databases while the specific genome assembly specimens for those species are strictly different from the genome assemblies used for building the reference databases. Simulated paired end Illumina reads using In- SilicoSeq with NovaSeq model. The samples have separate files for forward and reverse strands. Note that the listed read count per sample is for one strand. . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 The iss-path-subset dataset contains 44 samples with InSilicoSeq sim- ulated paired end Illumina reads. One sample in group control-g4 was assigned the wrong content when simulating and therefor excluded. Uniform average genome coverage of 20x for all species in the sam- ples. Five replicates per subgroup, ten species in each sample with reads drawn for a uniform coverage distribution of 20X genome. Note that the listed read count per sample is for one strand. . . . . . . . . 21 3.5 Composition of the db-param-opt dataset. . . . . . . . . . . . . . . . 22 4.1 Divided by sample group (5 replicates i each group) of the ds-human dataset the average [min, max] percentages of reads classified by Kraken 2 (K2) and bwa-mem 2 (BWA). The theoretical composition of the samples are 10 % human and 90 % bacteria. . . . . . . . . . . 27 4.2 Species level taxonomic classification metrics F1.5 score, true positive rate (TPR) and false discovery rate (FDR) (dataset sample mean [25 percentile, 75 percentile]) for different combinations of datasets and general bacterial screening databases, repeated for three Kraken 2 and Bracken parameter settings (left most column). . . . . . . . . . . 33 xv List of Tables 4.3 Species level taxonomic classification metrics F1, F1.5, true positive rate (TPR) and false discovery rate (FDR) for the dataset ds-ln-low run for the db-standard database with Kraken 2 confidence = 0.075 and Bracken read threshold = 750. The values listed are means of the five replicates in each dataset group together with the minimum and maximum [min, max] of the sample group. . . . . . . . . . . . . . 42 4.4 Species level taxonomic classification metrics F1, F1.5, true positive rate (TPR) and false discovery rate (FDR) for the dataset ds-un-low run for the db-standard database with Kraken 2 confidence = 0.075 and Bracken read threshold = 750. The values listed are means of the five replicates in each dataset group together with the minimum and maximum [min, max] of the sample group. Note that the group names are in reference to the pathogen subset databases and all bacteria in the db-standard were in this case considered for prediction. . . . . . . 43 A.1 Dataset from the article Evaluation of the Microba Community Pro- filer for Taxonomic Profiling of Metagenomic Datasets From the Hu- man Gut Microbiome by Parks et al. [1] adapted into the ds-ln-high dataset. Table recreated from Table 2 in the article. . . . . . . . . . . II B.1 Species level taxonomic classification metrics F1, true positive rate (TPR) and false discovery rate (FDR) (dataset mean [25 percentile, 75 percentile]) for different combinations of datasets and general bac- terial screening databases, repeated for three Kraken 2 and Bracken parameter settings (left most column). . . . . . . . . . . . . . . . . . IV B.2 Genus level taxonomic classification metrics F1.5 and F1 (dataset mean [25 percentile, 75 percentile]) for different combinations of datasets and general bacterial screening databases, repeated for three Kraken 2 and Bracken parameter settings (left most column). . . . . . . . . . V B.3 Genus level taxonomic classification metrics true positive rate (TPR) and false discovery rate (FDR) (dataset mean [25 percentile, 75 per- centile]) for different combinations of datasets and general bacterial screening databases, repeated for three Kraken 2 and Bracken param- eter settings (left most column). . . . . . . . . . . . . . . . . . . . . . VI B.4 Species level taxonomic classification of the ds-un-low dataset where only the subset of 7 species were considered. The average [min, max] number of true positive (TP), false positive (FP) and false negative (FN) for the five replicates in each dataset group are listed together with the theoretically expected number of positives (P) and negatives (N). The dataset was run for the databases db-standard with Kraken 2 confidence = 0.075 and Bracken read threshold = 750 and the db-bac- subset with Kraken 2 confidence = 0.3 and Bracken read threshold = 10,000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII xvi List of Tables B.5 Species level taxonomic classification metrics F1, F1.5, true positive rate (TPR) and false discovery rate (FDR) for the bacteria in the dataset ds-ln-high run for the db-standard database with Kraken 2 confidence = 0.05 and Bracken read threshold = 50. The values listed are means of the 10 replicates in each dataset group together with the minimum and maximum [min, max] of the sample group. . . IX B.6 Species level taxonomic classification metrics F1, F1.5, true positive rate (TPR) and false discovery rate (FDR) for the bacteria in the dataset ds-ln-high run for the db-bac-rep database with Kraken 2 con- fidence = 0.05 and Bracken read threshold = 50. The values listed are means of the 10 replicates in each dataset group together with the minimum and maximum [min, max] of the sample group. . . . . . X xvii List of Tables xviii 1 Introduction Resistance to antibiotics among pathogenic bacteria is listed by the World Health Organization as one of the largest threats to global health [2]. When antibiotics stop working, previously easily treatable infections can become life-threatening to individuals and a dire problem for healthcare systems. Infections can afflict anyone however people who are immunocompromised, such as premature babies and organ transplant recipients, are particularly at risk and in need of quick diagnostic methods [3][4]. An important part of tracking and prevention of infectious disease outbreaks in hospitals is identification of both the bacterial species and the specific genetic components that can cause antibiotic resistance [5]. There are several different types of methods for identification of species, such as those based on direct laboratory tests or DNA sequencing. An emerging DNA sequencing based technology is shotgun metagenomic sequencing [6]. It is generally faster than single isolate sequencing, where a time consuming cell cultivation step is needed, and captures more information (such as resistance information) than the amplicon metagenomic sequencing method does. In shotgun metagenomics all the DNA in the sample is directly sequenced without prior cultivation and separating of species [7]. However, the drawback of the method is that the bioinformatic analysis can become complicated by having a mixture of sequences from different species and abundances, as well as containing contaminants such as DNA from human hosts. 1.1 Aim and Scope The context of this work is the need for fast and comprehensive methods for screening for pathogens and associated antimicrobial resistance (AMR) in healthcare settings. A broader goal is to develop an automated bioinformatic pipeline for analysis of metagenomic samples, for example taken from patient’s bodily fluids, skin swabs or hospital facilities. The main purpose of the pipeline is to take raw whole genome shotgun metagenomic sequencing data as input, do analyses and output a report with identified organisms and markers for AMR. The analysis steps in a minimal proposed prototype can be divided into pre-processing of the sample data, removal of human host DNA, taxonomic classification, genome assembly and identification of antimicrobial resistance markers, shown schematically in Figure 1.1. The scope of the project is however limited to implementing and evaluating two parts of the prototype shotgun metagenomics pipeline, the taxonomic classification and filtering of human DNA. A more in-depth description of the aims and scope pertaining to 1 1. Introduction the selected parts are given in sections 1.1.1 and 1.1.2 below. Bioinformatic analyses can be inaccessible and expensive, especially of large whole genome sequenced metagenomic samples, as methods tend to be computationally intensive and require high memory usage[8]. To make the findings more applicable, the challenge is set for the pipeline to be adapted for running on a laptop (500 GB disc, 16 GB RAM and 8 CPUs). Furthermore, to achieve the functionalities, building blocks of open software programs and algorithms are used and there is thus no intention of constructing a fully original pipeline. Seq. file Pre- process Human filter Tax. class. Genome Assembly AMR ID Result report Figure 1.1: Main segments of the in silico pipeline for taxonomic classification and identification of antimicrobial resistance of shotgun metagenomic samples from the raw sequence files (FASTQ format). The samples are pre-processed with quality control of the reads in the sequence files, reads with human origin are thereafter removed in a filtering step before the sample is passed to taxonomic classification where the species in the sample are identified. The non-human reads in the sample are assembled, which is where the mixed sequence fragments are sorted and reassembled into the original coherent genome sequences for each organism. Using the genome assemblies, genes associated with antimicrobial resistance can be found. The focus of this thesis are the steps in red which are filtering of human host DNA and taxonomic classification. 1.1.1 Taxonomic Classification Knowing which microorganisms, especially pathogens, are present in a sample and which are not, can be central to treatment of infections or other medical interven- tions [9]. The function of a taxonomic classification step is to identify and name the organisms in the metagenomic sample based on unique properties of their genome sequences. An advantage of shotgun metagenomics is the possibility of capturing genome sequences of pathogens from all domains of life at the same time, however, due to project limitations the investigation into the problem of taxonomic classifica- tion is constrained to bacteria. Furthermore, the lowest taxonomic rank the bacteria will be differentiated at is species and it is limited to previously characterized species. Ideal taxonomic classification, within the scope of the project, is thus defined as cor- rectly classifying all non-novel bacteria in a shotgun metagenomic sample down to the species rank with zero false discoveries. There are multiple factors involved in determining the difficulty level and success of taxonomic classification. These include, but are not limited to, sample content, taxonomic classifier and reference database. The aim is to describe the individ- ual and combined effects of selected variables from each category and optimize for classification performance. The main aspects considered surrounding sample composition is how the classifica- tion is effected by the number of species in the sample, similarity between species and sequencing depth of the organisms in the sample. The scope is limited to short- read samples sequenced using Illumina technologies, which is one of the commonly 2 1. Introduction used sequencing methods for whole genome shotgun metagenomics [10]. The examination of taxonomic classifiers is focused on the K-mer based DNA-to- DNA classification program Kraken 2 [11] and its companion program Bracken [12]. Kraken 2 has consistently been among the top performers in regards to computa- tional time and classification recall, however lower on precision [8][13][1]. Bracken is a tool for Bayesian reestimation of abundance after classification with Kraken and has been shown to enhance the abundance estimates of the taxonomic classifications [12]. The objective is to investigate if parameter tuning of the Kraken 2 and the accompanying Bracken can improve classification performance. Kraken 2 uses an indexed database of genetic reference sequences and for fast classi- fication needs enough free memory to hold the database in memory or maps to disc as a slower alternatively [11]. As memory space is limited, there will for a fixed file size be a dichotomy between the amount of representation each species has and the number of species in the database. The aim is to survey if custom built reference databases can increase classification performance compared to a pre-built standard within the size constraints of being capable to run on a laptop. 1.1.2 Human DNA Filtering Collateral human DNA is expected to be included when doing shotgun metagenomic sampling directly from a patient, for example of blood or skin and oral swabs [14]. Mixed human origin DNA is also likely to appear in varying amounts when sampling from an environment such as hospitals [15]. In the context of a bioinformatic pipeline intended for clinical application, it is therefor likely for incoming samples to contain substantial amounts of human DNA if not removed in the laboratory. The content of the human DNA is not of interest for taxonomic classification of bacteria or identification of antibiotic resistance, however, storing and processing it would consume unnecessary resources as well as have implications for privacy issues as it can be linked to individuals. The bottleneck for computational resources is the genome assembly step. Genome assembly, where the mixed sequence fragments in the sample are sorted and reassembled into the original coherent genome sequences for each organism, is needed for the identification of antibiotic resistance genes. Depending on the program used for assembly, the time and resource usage can grow fast when the sequence complexity grows, such as when species and genome sizes are increased [16][17]. Filtering out human reads from the sample before assembly could reduce the complexity as the human genome is large compared to microorganisms. It is therefore relevant to investigate methods for filtering of human reads as a pipeline step when working with limited memory and computational resources. Before sequencing the metagenomic sample, physical human DNA fragments can be removed or degraded using commercial human host depletion kits [14][18]. However, the processes of human DNA elimination can have biased effects on the recovery of microbial DNA for certain species due to their characteristics [14]. Filtering of hu- man DNA reads in-silico, post sequencing, can be used as a compliment or preferred alternative to the depletion kits depending on the percentage human DNA in the sample and intended analysis [14]. There are established workflows for separating 3 1. Introduction human reads from the microbial reads in the sample FASTQ file based on direct alignment of reads to a human reference genome [19]. The k-mer based taxonomic read classifier Kraken 2 has been shown to be comparably fast at classifying mi- croorganisms in metagenomic samples [11]. It is proposed here to take advantage of the speed of Kraken 2 for the extrapolated task of human read classification and removal. The aim pertaining to a pipeline step for filtering out human DNA is to investigate if Kraken 2 can be a faster alternative to alignment based methods while quality wise remaining competitive. Specifically the questions to answer are: Can human origin reads be filtered out from an Illumina short-read shotgun metagenomic sample using the K-mer based taxonomic classifier Kraken 2? Is filtering by Kraken 2 comparable to a method using the Burrows-Wheeler aligner BWA-mem 2? 4 2 Theory The intent of the chapter is to provide a theoretical background on the methodology behind metagenomic sequencing data, reference databases and taxonomic classifica- tion programs. 2.1 Metagenomic Sequencing Metagenomic sampling is used when characterizing microbiomes in both the natu- ral and human environment [20][21]. Metagenomics is also emerging in the context of infectious disease detection for species identification and antimicrobial resistance prediction [9][6]. A metagenomic sample is a biological sample that contains ge- netic material from organisms, potentially multiple of different types, present in the sampled environment. When a metagenomic sample is collected, by for example a skin swab or tissue sample, first the genetic material, in this case DNA, is extracted from the cells [22]. To be able to interpret the information carried by the physical DNA strands they need to be converted into a readable form by a DNA sequencer. The extracted DNA strands could be sequenced by targeting certain regions us- ing amplicon methodology, or non-selectively in so called shotgun sequencing where everything in the sample is sequenced [23]. The output from DNA sequenced with next generation sequencing technologies is a digital read file, typically in the format .FASTQ, where each read of a DNA fragment is a transcription into letters accompanied by Phred quality scores of each DNA nucleotide [24]. The reads can be of differing length depending on the type of sequencer. For example, Illumina short-read sequencers give read lengths of 50- 300 bp and read lengths of 150 bp are recommended by Illumina for whole-genome sequencing [25][26]. Illumina sequencing can generate so called paired-end reads where a fragment sequenced in both directions gives rise to two paired reads and a sequenced sample can have two .FASTQ files, one for the reads of the forward strand and one for the reverse strand [24]. The average sequencing error rates for Illumina sequencers are less than 1 %, however, the errors are non-uniformly distributed across the reads and furthermore depend on the specific illumina platform used [27]. Information can be gained directly from the read, such as some types of taxonomic classification, or the reads can be assembled into the original genomes using the prop- erty of reads overlapping along the genome [23]. Following from that the organisms in the metagenomic sample are not separated before extraction and sequencing of the DNA, the reads from all organisms in the sample are mixed together in the 5 2. Theory .FASTQ file which is in contrast to sequencing of single isolate cultures where all reads originate from one type of organism [22]. The relative abundance of cells from a certain species in the physical sample will approximately carry over to the relative abundance of reads from that species in the .FASTQ file, however, the read abun- dance is also influenced by the length of the species’s genome. The average read coverage of a species genome is the average number of reads covering a nucleotide position in the genome. A metagenomic sample with a non-uniform distribution of species abundances will result in some genomes of species being covered by fewer reads than others, even below an average of 1X depending on the sequencing depth of the sample [28]. To increase the read coverage of the lowest abundance species the whole sample must be further sequenced which extends the sequencing time. Figure 2.1: One of several possible metagenomic processes where DNA is extracted from a biological sample and converted to a digital format by a sequence. The sequenced DNA data can then be processed to extract information such as species of origin or genes. 6 2. Theory 2.2 Plasmids A majority of a bacterium’s genes are located on its chromosome, however, a subset of accessory genes may be found on a plasmid which is extrachromosomal circular DNA [29], see Figure 2.2 (a). The functionalities of plasmids are diverse as well as their size which ranges from hundreds of base pairs to over a million [30]. The prevalence and type of plasmids varies between bacterial species but can also differ within a species, as illustrated in Figure 2.2 (b). Similar as for the chromosome, plasmids are replicated and passed on to daughter cells in association with cell di- vision, however, they also carry genes for independent transmission and replication [29]. Furthermore, plasmids are mobile genetic elements that can be horizontally transferred, by so called conjugation, between prokaryotic cells in the same genera- tion, see Figure 2.2 (c). Conjugation of plasmids can occur naturally between both bacterial cells of the same species and of two different species [31]. Plasmids are therefore a common vector for spread of genes associated with antibiotic resistance properties [29]. The plasmid transfer is however not random as it may be induced by environmental factors and is furthermore unequally distributed across bacterial species [32]. For example, the F-type plasmids are mainly transferred among species within the Enterobacteriaceae family group which includes pathogenic strains of Es- cherichia and Salmonella [33]. (a) (b) (c) Figure 2.2: Sketches of bacteria and with plasmids. (a) The plasmid is the smaller circular DNA element separate from the bacterial chromosome. (b) Bacteria may have zero, one or multiple plasmids of different types varying both intra and inter species. (c) Plasmid conjugation between two bacteria where the plasmid in Bacterium 1 is transcribed and then passed through a pilus temporary connecting ’tube’ to Bacterium 2. 2.3 Genomic Reference Sequences and Taxonomy The National Center for Biotechnological Information (NCBI) is maintained by the United States government and provides public access to biomedical and genomic in- 7 2. Theory formation [34]. The NCBI online resources takes the form of reference databases, se- quence analysis tools and repositories for research studies. NCBI genomic resources include genomes, mainly of human associated microorganisms, and are organized in databases with nucleotide sequences, genome assemblies and mapped annotations. GenBank is an archival database of publicly available genetic sequences. It accepts submissions of whole genome sequences and annotated genome assemblies in addi- tion to for example transcriptome assemblies and targeted locus studies. [35] Until the year 2022, 1,349,781 genome assemblies of cellular organisms were submitted of which 91 % were from bacteria [36]. RefSeq is another sequence collection dis- tributed by NCBI and it can be described as a non-redundant and curated version of GenBank [37] [38]. In contrast to Genbank, the NCBI owns the records and can up- date, combine and improve the annotations and sequences [38]. The two databases have similar content but differ in format and how the sequences are annotated. RefSeq contains genome assembly entries from all domains of life. For entries until and including the year 2022 there are 331,892 assemblies of cellular organisms of which 41 % is bacterial [36]. Furthermore, of the bacterial assemblies 33,364 are on complete genome level and when atypical genomes are excluded there are 32,645 assemblies remaining. More studied organisms are overrepresented in the number of assembly entries in RefSeq and the taxonomic resolution is often higher with entries categorized on lower taxonomic levels such as sub-species and/or strain level. For example a subset of seven pathogenic bacteria more closely examined in this thesis, A. baumannii, E. faecium, E. fecalis, K. pneumoniae, P. aeruginosa, S. aureus and S. epidermidis, make up 13.6 % of the complete level bacterial assemblies [36]. Every entry of a genome assembly in RefSeq is given a new unique RefSeq assem- bly accession alongside an individual accession for each FASTA sequence in the assembly. The assembly is tagged with its estimated level of completeness, where the alternatives in increasing order are contig, scaffold, chromosome and complete genome level. For a complete assembly of a bacterial organism the same number of sequences in the FASTA file are expected as the combined number of chromosomes and plasmids in the organism in question. The number and type of plasmids may not be internally consistent between exemplar of the same species as there can be a physiological variation in plasmid distribution. A subset of the RefSeq assemblies are categorized by NCBI as Representative genomes, also referred to as Reference genomes, based on their sequence annotation quality, recognition as a community standard or medical importance [38][36]. As to not con- fuse terminology, any genome can be used as a reference, for example when aligning reads or building an indexed database, however, the RefSeq category refers to as- semblies of particular importance and generally one per species. In exceptional cases there may be more than one reference genome per prokaryote species, for example E. coli has two. The number of bacterial reference genomes listed until 2022 are 16,765 spread across all assembly completeness levels. Another aspect of the RefSeq database entries are their status. A genome assembly may be assigned a suppressed status for reasons that include being replaced by another record to reduce redun- dancy and removal by curators due to lack of support [37]. Suppressed records are however still retrievable but have a marker. 8 2. Theory The taxonomy system used by NCBI is a hierarchical organization of organism re- lations with the main ranks of super kingdom/domain, phylum, class, order, family, genus and species [39], sketched in Figure 2.3 (a). All NCBI records, such as genome assemblies, are mapped to nodes in the taxonomic tree. For cellular organisms the distance between two organisms across ranks in the taxonomic tree is not always consistent with the actual genetic similarity between organisms and the taxonomic tree is thus not a phylogentic tree [40]. To capture the variability there may be extra ranks added, such as species complexes, sub-species and strains. With new research discoveries the taxonomy is continuously updated by adding or merging taxonomic nodes [39]. Furthermore changes of nomenclature conventions may also trigger updates of the NCBI taxonomy. For example an inclusion of the phylum rank in the International Code of Nomenclature of Prokaryotes lead to renaming or new inclusion of 41 phyla in NCBI effecting millions of records in January of 2023 [41]. For evaluation of taxonomic classification it is thus important to correctly align the versions of taxonomy and current scientific names used by the classifier and the metadata of metagenomic sample. 2.4 Kraken 2 - Taxonomic Classifier Kraken 2 is a faster and more memory efficient re-written version of the taxonomic classification program Kraken [11]. In Kraken 2 each read in a FASTQ sequence file of a metagenomic sample is classified independently (for short-read sequencing a read is 150 nucleotides long). Simplified it is based on the querying of read sub-sequences of length K against a specially built Kraken 2 database containing genomic reference sequences associated with nodes in a taxonomic tree, see Figure 2.3 (b). Some K-mers will uniquely map to nodes at the lowest taxonomic levels while others, by evolutionary relatedness or random chance, are shared by several lower taxonomic nodes and are only unique for a higher taxonomic node, here called a lowest common ancestor. For example, if the database was built with only one reference genome all the K-mers of that genome sequence would be uniquely mapped to its taxonomic node, if another genome sequence was added from a different node then fewer K-mers would be uniquely mapping to the first taxonomic node and for an infinite number of reference sequences no unique K-mers would be found. In reality the Kraken 2 database is a probabilistic compact hash table where only minimizers of the K-mers are stored, sub-strings L1TB [43]. When Kraken 2 is run, by default the entire Kraken 2 database is read into the random access memory (RAM) which can be problematic if the reference database is large. If instead the option --memory-mapping is used the database is instead read from disk however the runtime will increase as a result. For control of the database size, a limit of the final file size can be applied during the build step using the argument --max-db-size [11]. The same reference sequences will be represented as for the full size but with fewer than possible unique K-mers per sequence. In general, limiting the database size (for the same genome sequences) reduces the classification precision [43]. An alternative to limiting the database size in the build step is to start with fewer genomic reference sequences, possibly by representing less variation in for example each species or including genomes from fewer species in total. 10 2. Theory (a) Taxonomic tree (b) Kraken 2 read assignment (c) Bracken reestimation 1 (d) Bracken reestimation 2 Figure 2.3: (a) Nodes in a taxonomic tree where the vertical level represent ranks to- gether with an example of a ranked lineage for the bacterial species Stphylococcus aureus rooted at the rank of domain and branching out to the rank of strain. The taxonomic classification is limited to ranks of species and above. (b) The taxonomic classifier Kraken 2 assigns sequence reads to a node in the taxonomic tree. A read assigned to a lower rank will by default also be assigned to all the directly ascending nodes in the tree, however, the opposite is not true as reads can get ’stranded’ at higher taxonomic ranks if the clas- sification dose not reach the Kraken 2 confidence threshold for lower ranks. Note that the figure is a simplified sketch and would in reality contain all the ranks and nodes in the Kraken 2 reference databases used for the the taxonomic classification. (c) Bracken is applied on sample level after read assignment by a Kraken type classifier and it redis- tributes stranded reads from higher to lower taxonomic nodes. Note that the threshold for minimum number of prerequisite reads a node must have for Bracken to assign extra is here set at ten reads, and as such no stranded reads at the Staphylococcaceae node are redistributed to the Macrococcus. (d) A redistribution of reads by Bracken results in a reestimation of the sample abundance percentages, independent across each taxonomic rank, which are displayed in parenthesis. 11 2. Theory 2.5 Bracken - Abundance Reestimation Bracken is a program for Bayesian reestimation of abundances of metagenomic sam- ples after classification with Kraken, Kraken 2 or KrakenUniq [12][45]. Bracken has been shown to improve the overall classification performance in addition to abun- dance estimation when used in combination with Kraken type classifiers compared to use of only the classifier [8][43]. The tool is lightweight with runtimes less than a minute[13]. However to run Bracken, a reference database is required and it needs to be custom built for the specific Kraken, Kraken 2 or KrakenUniq database used for the classification. Because the Kraken type classifiers perform read level classification based on the lowest common ancestor, a proportion of the reads can get stranded at nodes for higher taxonomic ranks resulting in few reads at lower levels [12]. The problem is accentuated for species in the database with high average nucleotide identity and can thus result in uncertainty in the prediction of a samples species level composition. Bracken solves the problem of low classification rates at lower taxonomic ranks by probabilistic redistribution of ’stranded’ reads at nodes higher up in the taxonomic tree down towards the species rank (or any other requested taxonomic rank) [12], see Figure 2.3 (c) and (d). For a species node to receive additional reads it needs to be directly descending to the higher taxonomic node and have a minimum number of reads already assigned by the Kraken type classifier [45]. The threshold for minimum number of reads can be adjusted using the --THRESHOLD parameter. 12 3 Methods Various reference databases, metagenomic sample datasets and software programs were used for the investigation of human DNA filtering and taxonomic classification. In this chapter, first the content of the databases and datasets are described together with the methods used to build them. Thereafter performance metrics for taxonomic classification are defined and lastly the experimental runs are described in terms of combinations of databases, datasets and classifiers used. For orientation of where components place in the pipeline see the flowchart in Figure 3.1 where the main steps of a short-read analysis are displayed. A sample enters the pipeline in the form of a .FASTQ file, or two in the case of paired-end reads. The sample first is pre-processed by trimming low quality ends of reads and length filtering. Human origin reads can be removed from the sample by either Kraken 2 K-mer matching or genome alignment with BWA-mem2 using human reference databases. The remaining human free sample is past on to taxonomic classification and antibiotic resistance gene identification, however where only the former is further analyzed. The non-human reads are taxonomically classified with Kraken 2 using a bacterial reference database and the output is passed to Bracken for abundance re-estimation using a Kraken 2 complementary reference database. The results are collected and compared to possible sample metadata containing a theoretical sample species composition and known resistance genes. The pipeline output is a report per sample in addition to dataset summaries if samples are run in batch. 13 3. Methods Start Sample .FASTQ files Sample .FASTQ files Pre-processing Quality control, read trimming Filter type Human filter Kraken 2 classification Kraken 2 GRCh38.p14 database Human filter BWA-mem 2 alignment BWA GRCh38.p14 database Non-human reads Taxonomic classification Kraken 2 Kraken 2 Bacteria database Taxonomic read assignment Abundance re-estimation Bracken Bracken Bacteria database Tax. ID and abundance est. Result collection and processing Sample metadata Report End Genome Assembly Megahit Contigs Align genes BLAST Resistance gene database Res. genes and location Figure 3.1: Pipeline scaffold for human host filtering, taxonomic classification and antibiotic resistance gene identification. 14 3. Methods 3.1 Databases A collection of reference databases were built for assessing the impact of the database choice on human DNA filtering and the taxonomic classification of bacteria. An overview of the databases and their properties can be found in Table 3.1. Two databases, the Kraken 2 db-k2-human and the BWA-mem 2 db-bwa-human, were used for evaluating human origin DNA filtering. For general bacterial screening a prebuilt Kraken 2 and Bracken combined baseline database db-standard was com- pared against two pairs of bacteria only combined Kraken 2 and Bracken databases, where each set contains one with and one without plasmid sequences removed, db- bac-comp and db-bac-comp-np and then db-bac-rep and db-bac-rep-np. In addition the Kraken 2/Bracken db-bac-subset database containing only a subset of seven pathogenic bacteria was built for screening specifically for a selection of bacteria. Database Name Content Plas- mids? Gen- omes Size cap? Size (GB) db-k2-human Human - 1 no 4.6 db-bwa-human Human - 1 no 20.1 db-standard[44] Archaea, bacteria, human, virus yes - yes 8.0 db-bac-comp Bacteria yes 33k yes 8.6 db-bac-comp-np Bacteria no 33k yes 8.6 db-bac-rep Bacteria yes 17k yes 8.0 db-bac-rep-np Bacteria no 17k yes 8.0 db-bac-subset Bacteria yes 5k no 0.4 Table 3.1: Databases for human DNA filtering, taxonomic classification of general bacteria and taxonomic classification of a subset of pathogenic bacteria. In the database names, Kraken 2 is abbreviated k2 and BWA-mem 2 bwa and it indicates for which programs the databases were constructed while the other databases are combined for Kraken 2 and Bracken referencing. 3.1.1 Human Reference Both K-mer and alignment based methods, used for detection of human reads in a sample, require an indexed database including human reference sequences. The same human genome assembly GRCh38.p14 [46] from the Genome Reference Consortium was used as the reference when building the K-mer based Kraken 2 database db-k2- human and the alignment based BWA-mem 2 database db-bwa-human. The db-k2-human was built from the assembly with default database building param- eter settings of a K-mer lenght of 35 nucleotides, minimizer length of 31 nucleotides and minimizer spaces of 7 nucleotides. Contrary to default settings, no masking of the human genome regions containing low complexity sequences, such as regions with repeated ’ACACACAC’ or ’AAAAAAA’, was performed before building the in- dexes with the argument --no-mask. Furthermore, no memory capping was applied to the build and the resulting database had the size of 4.6 GB. 15 3. Methods For the db-bwa-human database each chromosome, and collected mitochondrial DNA, in the human genome assembly was indexed separately by BWA-mem 2 before being combined. The combined database size was 20.1 GB, although, as a result of the chromosome split no individual file exceeded 1 GB and could therefore be loaded entirely into memory when running on a laptop with 16 GB RAM. 3.1.2 Taxonomic Classification - General Screening The baseline database was fetched from the Index zone collection of prebuilt Kraken 2 and Bracken combined databases and here named db-standard (original name Standard-8, version 2022-12-09) [44]. The collection, which is regularly updated, contains databases with various content and sizes, and is maintained by Ben Lang- mead who is associated with the Kraken projects [44]. According to Index zone the Standard-8 database was built using all complete level assemblies for bacteria, archaea, plasmid and virus listed on RefSeq until 2022-12-9. The human genome (GRCh38) was also included in the Standard-8 and its Kraken 2 library was cre- ated with the --no-mask argument which means that low complexity regions of the genome were not masked. The NCBI UniVec_Core library containing vector, adapter, linker, and primer sequences were also included in the database. The Standard-8 is the 8 GB size capped version of the Standard 64 GB database and was constructed using the exact same genome assemblies, however with less of each included, which was accomplished with the --max-db-size argument. Lastly, the Kraken 2 complimentary Bracken database was added to the Standard-8 database. Apart from no masking of low complexity regions of the human genome and size capping, the Kraken 2 and Bracken databases were built with defaults settings. The combined Kraken 2 and Bracken index database db-bac-comp was built using only sequences from the bacterial domain. Sequences downloaded were all RefSeq bacterial genome assemblies with assembly level of complete, recorded before 2022- 12-31, and with an exclusion of genomes tagged as atypical. Assemblies with a RefSeq status of suppressed were automatically excluded by Kraken 2 during the build step, which amounted to 5.3 % of the total number of downloaded sequences. Default parameter settings were used in the Kraken 2 database build apart from applying the argument --max-db-size. The size of the hash-map was intended to be limited to 8 GB and the same size as the db-standard, however, the wrong gigabyte definition was applied which resulted in a size of 8.6 GB. A Bracken database was thereafter built as an attachment to the Kraken 2 database using default settings. A combined Kraken 2 and Bracken bacterial database without plasmids, db-bac- comp-np, was constructed using the same bacterial genome assemblies and parameter settings as db-bac-comp with the exception that the plasmid sequences were excluded from the sequence library before the database build step. The plasmids were removed by searching and deleting sequences with the word ’plasmid’ in the sequence header for each assembly FASTA file. The databases db-bac-rep and db-bac-rep-np were built using the same methods as the databases db-bac-comp and db-bac-comp-np respectively except with a different set of genome sequences. The sequences used were bacterial assemblies in the Ref- 16 3. Methods Seq category of Representative/Reference genomes listed up until 2022-12-31. The assembly level of completeness ranged from contig to complete. No records were sup- pressed out of the 923,148 individual sequences downloaded however 8 had uniden- tified sequence accessions and were therefor excluded from the Kraken 2 database build. 3.1.3 Taxonomic Classification - Pathogen Subset Screening A reference database, db-bac-subset, for Kraken 2 with supplementary Bracken database was built for a subset of 7 bacterial pathogens. The sequences used con- sisted of all RefSeq complete level assemblies of the bacteria A. baumannii, E. fae- cium, E. fecalis, K. pneumoniae, P. aeruginosa, S. aureus and S. epidermidis listed until 2022-12-31. Of the downloaded sequences, 10.2 % were automatically excluded from use in the databases as a result of having a NCBI suppressed status. The db- bac-subset database was built using default settings including the default of no size capping. Without size limitations the resulting database was 0.4 GB. 3.2 Simulated Metagenomic Datasets Read simulation was used to create a set of metagenomic samples with known com- position, for which taxonomic classification could be evaluated. Indeed the exact sample composition, in terms of species and their proportions, was in it self required in order to obtain high resolution classification performance metrics. The datasets of raw read shotgun metagenomic samples used in this project were all generated by simulation from already taxonomically profiled reference genomes. One dataset was adapted from an 2021 article on the evaluation of taxonomic classifiers (Parks et al. [1]) and four were simulated anew in the project using the illumina short-read sim- ulation program InSilicoSeq (v.1.5.4) [47]. The samples in the InSilicoSeq datasets (ds-ln-low, db-human, ds-un-low and ds-param-opt) were simulated using distinct genome sequences from those used for building the reference databases, however, there was overlap in source sequences between databases and the article adapted dataset (ds-ln-high). An overview of the five datasets, that in total contain 250 samples, can be found in Table 3.2. Dataset name Number of samples Sequence in databases? Bacterial species per sample ds-ln-high [1] 140 yes 50-600 ds-ln-low 40 no 10-80 ds-human 20 no 20-40 ds-un-low 44 no 10 ds-param-opt 6 no 10-80 Table 3.2: Overview of simulated datasets used. 17 3. Methods 3.2.1 Many Species Dataset - ds-ln-high The ds-ln-high consist of 140 simulated Illumina short-read (150 bp) samples of mock prokaryotic communities adapted from Parks et al. [1]. The reads were simulated using in part the same RefSeq genome assemblies as used for the reference database build (db-standard). However, as the average nucleotide identity (ANI) to the ref- erence genomes was modified in the simulation the authors estimated that using in part the same original genome assemblies for databases and sample construction would not significantly biaas the classification results. The dataset was divided into 14 groups of 10 samples with replicate compositions, meaning that the replicates were drawn from the same distributions for factors such as species diversity, number of species and average nucleotide identity to closest reference genome. The samples included mainly bacteria but also archaea with number of species per sample draw from a normal distribution with either µ = 100, σ = 25 or µ = 500, σ = 100, log-normal species abundance distributions and varying diversity in strains per species. The number of nucleotides simulated per sample (2.1∗109 bp/sample) was independent of the number of species in the sample. Following that the samples had a fixed number of reads and that the within sample species abundances were log-normally distributed, some species in samples with a high number of species had relative abundances as low as 1.9 ∗ 10−6%. For context, a bacterial species, that on average has a genome size of 5 million bp [48], was on the extreme low side represented in a sample by only 40 reads of 150 bp which means a «1X coverage. For a display of all sample group variations see Table A.1 in Appendix A and Parks et al. original article for a full explanation on the dataset simulation methodology. Before the 140 samples were used the metadata was modified with new taxon names. Since the article was published in 2021 the scientific names of individual species have changed as well as major updates to phylum level names have occurred. The metadata was therefore updated using the same NCBI taxonomy as was used for building the classifier reference databases. 3.2.2 Fewer Species Dataset - ds-ln-low The ds-ln-low dataset was created to cover a lower species complexity range in com- bination with a higher mean read coverage per specie genome than the ds-ln-high. As can be seen in Table 3.3, the dataset contains 40 metagenomic samples of log- normal distributed bacterial communities, divided into 8 groups of number of speceis and read coverage profiles. Each group has 5 replicate samples with a fixed num- ber of species, 10, 20, 40 or 80, with a species average read coverage drawn from a log-normal distribution with the normal µ either being 10X or 20X average genome coverage. The randomness in the sampling process that effected the replicates came from which species were drawn and their specific average read coverage. Sequences used as simulation references were bacterial complete genome level assemblies in the NCBI sequence archive Genbank that were not used for building the classification reference databases. At the time of download (2023-03-17) there were 1076 bac- terial species represented among the complete level assemblies, however, only 390 18 3. Methods intersected with the species in the classification databases. The pool of genome assembly specimens drawn from to make up a sample were thus belonging to those 390 species found both among the GenBank complete level assemblies (not used for building classification databases) and the genome assemblies in the classification databases. As an example, a sample with 10 species and a coverage µ of 20X (group s10-c20 in Table 3.3) was generated by first sampling without replacement 10 species out of the 390 available. The corresponding genome assemblies were then identified for each of the 10 species and if there were more than one assembly per specie a random specimen was selected. Thereafter, 10 values were sampled from a log- normal distribution with µ = 20 and assigned one each to the species as an average read coverage of the genome. The accessions of each sequence in the 10 assemblies (for example chromosome 1, plasmid 1A, plasmid 1B, chromosome 2, plasmid 2A, ..., chromosome 10, plasmid 10A) were paired with their assigned average genome coverage (for example 5.3X, 5.3X, 5.3X, 19.2X, 19.2X, ..., 22X, 22X) and given as input to the InSilicoSeq read simulator. The result was that the plasmids and chromosome of a bacteria had the same average coverage. Furthermore, sampling for average read coverage per species, instead of number of reads per species, took into consideration that species have differing genome sizes. InSilicoSeq then simulated the sequencing of a physical shotgun metagenomic sample of the 10 species with the listed community fractions. It generated double stranded 150 basepairs long reads covering the genome assemblies the assigned average times (5.3X, 19,2, ..., 22X) while applying sequencing error patterns according to a model trained on the Illumina NovaSeq sequencer. All the generated reads were then finally collected into two FASTQ files, one for each strand. Note that the total number of reads generated per metagenomic sample was a result of the input parameters (more species and higher coverage resulted in more reads) and thus also the size of the FASTQ files. Sample group Replicates No. species per sample Coverage distribution µ Mean no. reads per sample s10-c10 5 10 10x 2.2M s10-c20 5 10 20x 4.0M s20-c10 5 20 10x 4.6M s20-c20 5 20 20x 9.0M s40-c10 5 40 10x 7.0M s40-c20 5 40 20x 15.4M s80-c10 5 80 10x 20.1M s80-c20 5 80 20x 31.5M Table 3.3: The iss-ln-rand dataset. Log-normal distribution of community fraction taking into account that species have different genome sizes. The species pool drawn from are represented in the reference databases while the specific genome assembly specimens for those species are strictly different from the genome assemblies used for building the reference databases. Simulated paired end Illumina reads using InSilicoSeq with NovaSeq model. The samples have separate files for forward and reverse strands. Note that the listed read count per sample is for one strand. 19 3. Methods 3.2.3 Human Spike Dataset - db-human The db-human dataset was generated by using 20 samples from the ds-ln-low dataset, specifically the sample groups s20-c10, s20-c20, s40-c10 and s40-c20, as a bacterial base and then spiking them with human reads. The human reads were added such that the final composition of each sample was 10 % human and 90 % bacteria. The two chromosome level assemblies used for generating human reads were the KO- REF_S1v2.1 [49] assembly of a Korean man and the Ash1_v2.2 [50] assembly of an Ashkenazi Jewish man. No value was placed on the specific ethnicity of the assem- blies, but rather that they were from distinct individuals and of differing ethnicity from the GRCh38.p14 assembly which was used in the classification and alignment databases db-k2-human and db-bwa-human. From each assembly, 10 M NovaSeq illumina paired end reads were generated using InSilicoSeq. The required number of human reads was thereafter added to the bacterial samples by random draw from the in total 20 M simulated read pairs (sampling with replacement between each metagenomic sample). 3.2.4 Pathogen Subset Dataset - ds-un-low A dataset was needed for testing if the content of classification databases could be adjusted for better detection of a subset of pathogenic bacteria. The content and structure of the ds-un-low dataset was thus designed in relation to the classification database db-bac-subset. The classification database was built on genomes from seven bacterial species, A. baumannii, E. faecium, E. fecalis, K. pneumoniae, P. aerugi- nosa, S. aureus and S. epidermidis, which belong to the five genera Acinetobacter, Enterococcus, Klebsiella, Pseudomonas, and Staphylococcus. The approach to the dataset is to have samples with different amounts of species and genera overlapping with the content of the databases. To limit the size of the dataset, only two of the seven species, A. baumannii and S. aureus, were chosen as bases for sample construction. In total, the dataset has 44 samples divided into nine groups with five replicates each (one group has four samples due to a error in simulation). Each sample consists of a different set of 10 species with a fixed uniform average read coverage per genome of 20X. As described in Table 3.4, in sample group a-baumannii-g0 all samples contain A. baumannii, which is in the database db-bac-subset, plus nine random bacterial species which are not in the database nor belong to a genus that is present in the databases. The samples in group a-baumannii-g1 all contain A. baumanni, one other random Acinetobacter species and then eight random bacterial species not in the database or have a genus in the database. Last in the A. baumannii series is group a-baumannii-g4 that contains samples with A. baumannii, four other random Acinetobacter species plus five other random species not in the database or have a genus in the database. A similar series arrangement was thereafter done for the samples in groups s-aureus-g0, s-aureus-g1 and s-aureus-g4 with S. aureus as a base. As S. epidermidis also belongs to the genus Staphylococcus it was excluded from the genus draws for the S. aureus groups. In addition a negative control series was made. In group control-g0 there is no overlap of species nor genera between the 20 3. Methods samples and databases. Groups control-g1 and control-g4 consist of samples with one or four random species from the genera in the databases, but not the specific species, plus nine or six random species of a genera not in the database. After the species composition of each sample was determined, corresponding refer- ence genome assemblies were selected. With a list of accessions of the sequences in the assemblies and a fixed average read coverage of 20X, the metagenomic samples were then simulated using InSilicoSeq in same way as for ds-ln-low. As can be seen in Table 3.4, even if the number of species and coverage is uniform, the group mean of the number of reads per sample varies between 2.3M an 2.8M because of the different genome lengths of the species in each sample. Sample group Repli- cates Species in DB Genus in DB Species not in DB Genus not in DB Mean no. reads control-g0 5 0 0 10 10 2.6M control-g1 5 0 1 10 9 2.8M control-g4 4 0 4 10 6 2.5M a-baumannii-g0 5 1 1 9 9 2.7M a-baumannii-g1 5 1 2 9 8 2.8M a-baumannii-g4 5 1 5 9 5 2.5M s-aureus-g0 5 1 1 9 9 2.3M s-aureus-g1 5 1 2 9 8 2.4M s-aureus-g4 5 1 5 9 5 2.4M Table 3.4: The iss-path-subset dataset contains 44 samples with InSilicoSeq simulated paired end Illumina reads. One sample in group control-g4 was assigned the wrong content when simulating and therefor excluded. Uniform average genome coverage of 20x for all species in the samples. Five replicates per subgroup, ten species in each sample with reads drawn for a uniform coverage distribution of 20X genome. Note that the listed read count per sample is for one strand. 3.2.5 Parameter Optimization Dataset - ds-param-opt Extra metagenomic samples were created for parameter tuning of the taxonomic classifier. The ds-param-opt dataset is a mixed collection of six InSilicoSeq simulated samples similar to the samples of ds-ln-low and ds-un-low. The samples have 10-80 species randomly drawn from the same pool as for the ds-ln-low dataset with either log-normal or fixed uniform abundance distributions. For the composition of each sample see Table 3.5. 21 3. Methods Sample Name No. species Average genome coverage No. reads sample-ln-s10-c20 10 from log-normal dist. µ = 20x 2.9M sample-ln-s15-c20 15 from log-normal dist. µ = 20x 1.7M sample-un-s20-c15 20 fixed uniform of 15x 1.7M sample-ln-s25-c10 25 from log-normal dist. µ = 10x 4.3M sample-ln-s40-c15 40 from log-normal dist. µ = 15x 11.4M sample-un-s80-c20 80 fixed uniform of 20x 20.4M Table 3.5: Composition of the db-param-opt dataset. 3.3 Classification Metrics When a metagenomic sample in the digital FASTQ format is passed through the taxonomic classification step the simplified result is a list of predicted species (predic- tions are also made at all other taxonomic ranks) and the number of reads classified per species by Kraken 2 and then the re-estimated number by Bracken. As the sam- ples used in this project were simulated, the true species composition was known which is in contrast to real world samples taken from physical microbiomes. In the- ory each simulated read fragment could be traced to its origin organism, however, in this project the species metadata was on sample level with only the total number of reads per species known. As such the classification performance was also measured on sample level and not if each individual read fragment was correctly classified. A True Positive (TP) observation is here defined as a species predicted by the Kraken 2 and Bracken combination that was actually in the sample, irregardless of if the predicted and actual abundances (>0) were the same. The question of precision in abundance prediction was not considered in the project scope. A False Positive (FP) observation is defined as occurring when a species is predicted that was not actually in the sample and reversely a False Negative (FN) is when a species actually in the sample was not predicted by the taxonomic classifier. The True Negative (TP) observations would theoretically be all the, potentially thousands, species in the database not actually present in the sample and that were also not predicted to be in the sample. It is common to apply a post classification filter to discount predicted species below a certain threshold read count or abundance percentage before calculating the TP, FP and FN [8]. However, the raw output from Bracken was here considered the predicted species if not otherwise specified. Classification metrics can be derived from the sum of the TP, FP and FN ob- servations for a sample. The metrics applied here for measurement of taxonomic classification performance were True Positive Rate (TPR), Positive Predictive Value (PPV), False Discovery Rate (FDR) and F-score (Fβ). The true positive rate, Equa- tion 3.1, also known as recall, is defined as the the number of true species found by the classifier divided by the actual number of species in the sample. TPR = TP TP + FN (3.1) 22 3. Methods Positive Predictive Value, Equation 3.2, also know as precision, is defined as the proportion of species that the classifier correctly predicts being in the sample over the total number of species that are predicted. PPV = TP TP + FP (3.2) False Discovery Rate, Equation 3.3, is the number of species that the classifier falsely predicts being in the sample divided by the total number of predicted species. The false discovery rate is the inverse of the positive predictive value. FDR = FP TP + FP = 1 − PPV (3.3) The Fβ score, Equation 3.4, is a weighted mean of the true positive rate and positive predictive value, where β is the factor of importance of the true positive rate in relation to the positive predictive value. Fβ = (1 + β2) ∗ TP (1 + β2) ∗ TP + β2 ∗ FN + FP (3.4) The Fβ score for β = 1.5 was mainly used here, where the true positive rate was weighted 1.5 times the positive predictive value. Although, the more commonly used metric for comparing taxonomic classifiers is the F1 score and it was used in addition to F1.5 [8]. 3.4 Experimental Runs In this project, a sample run refers to two FASTQ files with simulated paired-end reads of a shotgun metagenomic sample being processed through the pipeline to- gether with a metadata file on the samples theoretical content. Because the focus of the investigation was on human DNA filtering and taxonomic classification, the analysis steps surrounding genome assembly and identification of antibiotic resis- tance genes were switched off to save time. However, minimal pre-processing in the form of read quality trimming and filtering was applied for each run using the program Fastp (v.0.23.2) [51]. The samples in the db-human dataset was run with both the Kraken 2 (v.2.1.2) and BWA-mem 2 (v.2.2.1) human read filtering methods. The reads in the samples were first classified with Kraken 2 (confidence parameter = 0.2) and the db-k2-human reference database as either human or non-human. The non-human reads were then passed on to taxonomic classification with the Kraken 2 (confidence parameter = 0.075) and Bracken (v.2.8) (read threshold = 750 reads) using the db-standard database. Thereafter all the reads, divided into the categories db-k2-human classi- fied, db-standard classified and db-standard unclassified, were aligned to the human genome using BWA-mem 2 and the db-bwa-human database. The total number of reads aligned per sample and Kraken 2 classification categories could then be established. 23 3. Methods The three larger datasets, ds-ln-high, ds-ln-low and ds-un-low, were run through the pipeline for taxonomic classification with Kraken 2 and Bracken using the default parameter settings of Kraken 2 (confidence = 0.0) and Bracken (read threshold = 10 reads)1. The runs were repeated for each of the five bacterial screening databases db-standard, db-bac-comp, db-bac-comp-np, db-bac-rep, db-bac-rep-np. Additionally, even though the datasets theoretically do not include human genomes, they were all run with prior Kraken 2 human filtration step (confidence parameter = 0.2). The six samples in the ds-param-opt dataset were run repeatedly for Kraken 2 confidences of 0.0-0.2 and Bracken read thresholds of 10-1250 reads for the db- standard reference database. The datasets ds-ln-high, ds-ln-low and ds-un-low were thereafter re-run for all the screening databases with the combination of the Kraken 2 confidence (0.075) and Bracken read threshold (750 reads) which gave the highest F1.5 score for iss-aram-opt. Furthermore the datasets ds-ln-high, ds-ln-low and ds- un-low were run again for the five screening databases but with a more conservative parameter value increase of Kraken 2 confidence = 0.05 and Bracken minimum number reads = 50, compared to the optimized. Lastly the pathogen ds-un-low dataset was taxonomically classified with the bacterial subset databas db-bac-subset using Kraken 2 with confidence = 0.3 and Bracken with a minimum number read threshold of 10,000 reads. 1The default minimum number of reads classified by Kraken 2 at a taxon node for Bracken to redistribute them is set to 0 in the source code but stated as 10 in the documentation (v.2.8) [45]. The threshold of 10 reads is in this project used as the default. 24 4 Results and Discussion 4.1 Human DNA Filtering Assembly of the genomes in the metagenomic sample, that is reconstructing the genomes of the organisms from the reads fragments in the sample, is a step needed before the search for antibiotic resistance genes. However, the process of assembling genomes from a metagenomic sample is expensive in terms of computational time and memory and it is therefore resourceful to not assemble reads that have a non- bacterial origin, such as human, as they do not carry the resistance genes. Ideally exactly all human reads would be removed and none of the reads with other origins, however, in reality that is hard to achieve. The threshold could either be set to miss some human reads or incorrectly filter out some bacterial reads. Arguably it would be better to have the assembly take slightly longer time by removing most but not all human reads compared to potentially removing bacterial reads carrying important information. The first approach to filtering out human reads was to do it in association with taxonomic classification using Kraken 2 and the db-standard database as it contains a human reference in addition to microorganisms. The filtration was unsatisfactory as it was observed that a substantial proportion of the human reads were classi- fied as microorganisms, mostly fungi, or were unclassified. Adjusting the Kraken 2 confidence parameter for better human filtration would however at the same time impact the classification of bacteria. As the db-standard database was size restricted when building, it was hypothesized that running against an unrestricted K-mer in- dexation of the human genome, that is the db-k2-human database, could catch a larger proportion of the human reads. Although, building a Kraken 2 database with only one species is contrary to the principle of Kraken 2 classification based on a database with distinct K-mers brought out by the internal uniqueness of K-mers among the genomes used to build the database. To counter balance for only having one species, and thus low variance, in the database, the Kraken 2 confidence pa- rameter was increased from default of 0.0 to 0.2 which would require more human K-mer matches per read in order for the read to be classified. In addition, apply- ing a human read filtering step separate from the taxonomic classification enables independent parameter settings. An alternative method to K-mer based classification of human reads is direct align- ment of reads to the human genome and it was tested for comparison purposes. The BWA-mem 2 sequence alignment program was chosen because it is a widely 25 4. Results and Discussion used short-read aligner and is among the top performers. bwa-mem was shown to have a lower runtime per read than one of the closest competitors Bowtie 2, while having similar alignment performances [52]. Thereafter, BWA-mem 2 was released and shown to be faster than bwa-mem [53]. The theoretical composition of each of the 20 samples in the ds-human dataset is 10 % human origin reads and 90 % bacteria. The result of running the samples in the dataset for the Kraken 2 database db-k2-human (confidence = 0.2) and then db-standard (confidence = 0.075) on the non-human filtered reads was an average 9.98 % of the reads classified as human by db-k2-human, 60.52 % classified by db- standard and 29.50 % were unclassified by db-standard, see Figure 4.1. When all the reads in the samples then were aligned to the human genome with BWA-mem 2, on average 9.33 % of the reads aligned. Moreover if the aligned reads were subdivided into the Kraken 2 classification categories, as seen in Table 4.1, then 92.84 % of the Kraken 2 db-k2-human classified reads aligned to human genome, 0.03 % of the db-standard classified aligned and 0.16 % of the db-standard unclassified. As the the human reads were not traced on read level one can not draw the immediate conclusion that because 9.98 % of the reads in the samples were classified as human by Kraken 2 that they truly were human reads. That on average 92.84 % of the Kraken 2 human classified reads in turn also did align to the human genome using BWA-mem 2 strengthens the indication that the majority of reads filtered out are truly human. Another indication is that <1% of the reads classified by Kraken 2 to not be human (either unclassified or classified as a microorganism) aligned to the human genome using BWA-mem 2. (a) Theoretical composition (b) Kraken 2 outer, bwa-mem 2 inner Figure 4.1: Average percent of reads filtered out as human in the iss-human dataset where (a) each sample independent of size has a theoretical composition of 10 % simulated human reads and 90 % bacterial reads. The outer doughnut in (b) represents the average proportions of reads that are filtered out by k2-human and classified or unclassified by k2-b-standard. The inner doughnut represents the reads that do or do not align to the human genome using BWA-mem2 in relation to the outer doughnut. It can be seen that the human aligned reads are concentrated by the Kraken 2 human classified reads. 26 4. Results and Discussion Sample group % K2 human filter of total % BWA align to human of total % BWA align of K2 human filter % BWA align of K2 classified % BWA align of K2 un- classified s20-c10-h10 9.98 [9.97,9.99] 9.68 [9.53,9.78] 96.34 [95.04,97.44] 0.03 [0.02,0.03] 0.18 [0.09,0.38] s20-c20-h10 9.97 [9.97,9.98] 9.37 [8.91,9.56] 93.21 [88.73,95.28] 0.03 [0.02,0.05] 0.18 [0.13,0.25] s40-c10-h10 9.97 [9.97,9.98] 9.35 [9.13,9.56] 93.17 [91.10,95.34] 0.03 [0.02,0.03] 0.14 [0.10,0.20] s40-c20-h10 9.98 [9.97,9.98] 8.91 [8.74,9.13] 88.66 [87.00,90.95] 0.03 [0.02,0.04] 0.14 [0.12,0.16] Full dataset 9.98 [9.97,9.99] 9.33 [8.74,9.78] 92.84 [87.00,97.44] 0.03 [0.02,0.05] 0.16 [0.09,0.38] Table 4.1: Divided by sample group (5 replicates i each group) of the ds-human dataset the average [min, max] percentages of reads classified by Kraken 2 (K2) and bwa-mem 2 (BWA). The theoretical composition of the samples are 10 % human and 90 % bacteria. The percentages of reads filtered with Kraken 2 and BWA-mem 2 are consistent across sample with differing number of reads and number of other bacterial species in the sample, which can be seen across the sample groups in Table 4.1. The result is expected because the classification and alignment is independent for each read and more or less other bacterial reads should not effect the human reads. The parameter settings for Kraken 2 and BWA-mem 2 were not optimized for human filtration and it is likely that both approaches could be equally as good at filtering if a rigorous parameter optimization was performed. Kraken 2 however strongly outperformed BWA-mem 2 for speed as it on average, for the 20 samples run, was 94 times faster. Can the results be applied to real world samples from the human microbiome? The main limitation is that the human reads in the test dataset were simulated and were of a relatively high sequencing quality which would be expected to vary in real samples. Furthermore, when simulating the human reads, the coverage was uniformly distributed across the genome, which is unlikely for a real world human genome. Also noteworthy is that the Kraken 2 confidence was set for classification with the db-k2-human database for separating human reads from bacterial reads, hence it is questionable if the high performance withholds for human filtering from more similar types of organisms such as other Eukaryotes. In a broader perspective, if a similar pipeline were to be used for example in a veterinary setting then the genomes of animals in question would be used to build the filtration database. This is relevant since the problem of antibiotic resistance is present in agriculture too. It could possibly be easier to construct a Kraken 2 database with the animal species in question compared to development of a physical host DNA extraction kit to apply pre-sequencing. 27 4. Results and Discussion 4.2 Classifier, Parameters and Thresholds Kraken 2 is know for its speed and high recall for microbial classification but also for its low precision [1][43]. These features were observed in this project, where false discovery rates of higher than 90 % were common for default Kraken 2 and Bracken parameter settings which meant that a majority of the species identified by the classifier were false positives. What an acceptable false discovery rate is would depend on the specific intended application of the classification results, however, a rate of > 90 % is likely unacceptable for most clinical screening purposes. Imple- menting measures to control the false discovery rate would generally also effect the true positive rates. The goal was therefor to find an approach that reduced the false discovery rate as much as possible without severely reducing the true positive rate. The main classification metric used for harmonizing the two was the F1.5 score. The F1.5 score, where true positive rate is weighted higher than the positive predictive value, was chosen over the more common F1 score because it was considered more important to not miss potentially dangerous pathogens than to have false alarms. The intention was for the taxonomic classification step to be part of a fast screening pipeline and if bacterial species of interest were found then further tests with higher precision could possibly be done for increased certainty. There are several approaches to the problem of lowering the false discovery rate. For example, adjustments can be applied to parameters in the Kraken 2 read classi- fication, which reads to be considered for abundance re-estimation by Bracken and by setting post classification filtering thresholds of minimum estimated community abundance or minimum number of estimated reads assigned to a taxon for it to be considered found. Taking it further, the pre-processing steps before taxonomic classification, such as the length and quality trimming, could also play a role but were not considered within the project scope. Tuning of Kraken 2 and Bracken parameters were first considered. 4.2.1 Kraken 2 and Bracken Parameters The parameters considered were the read classification confidence for Kraken 2 and the threshold for minimum number classified reads required for abundance reesti- mation by Bracken. Other Kraken 2 parameters, such as K-mer length or minimizer length, are also relevant for classification performance as was shown in the Kraken 2 method article [11]. These parameters are however tied to the Kraken 2/Bracken database construction which in turn can be cumbersome to rebuild for a variety of settings with limited computational resources. The confidence and read thresh- old were on the other hand adjustable for each sample run without re-building the reference databases. Ideally a dataset with a large variety in sample complexity, distribution and species diversity, would have been used for finding the optimal parameter settings for a general metagenomic sample or a sub-group of samples. However, due to restrictions in time and sample simulation abilities the number of samples in the dataset used for parameter optimization ds-param-opt was held low and with fewer species. It is 28 4. Results and Discussion important to note that the six sample dataset ds-param-opt used for the optimization was not a representation of all possible shotgun metagenomics samples, and contains samples with lower species diversity than for example the ds-ln-high and simulated from a pool of only 390 species. The motivation behind preforming an optimization was not to find the universal optimal metagenomic parameter settings but rather to demonstrate effects of varied parameter settings for the same dataset in relation to the default. The samples in the ds-param-opt dataset were taxonomically classified using the db-standard database for a range of Kraken 2 confidence settings and thresholds for Bracken minimum number of reads required for read redistribution. In Figure 4.2 the average positive predictive value, true positive rate, F1 and F1.5 scores for the samples in the dataset are shown for Kraken 2 confidences [0.0, 0.2] and Bracken read thresholds [10, 1250]. The results are only in relation to bacteria and possible false positives from other domains were not include. For default settings, the dataset average true positive rate was 100 % which means that all species in the samples were identified. The positive predictive value was however only 4 % which means that 96% of the species predicted by the classifier were false positives. The resulting average F1.5 score was 12 %. Increasing either the confidence or the read threshold increased the positive predictive value and decreased the true positive rate. Applying both a high confidence (0.2) and read threshold (1250 reads) instead gave an average positive predictive value of 100 % however with the cost of decreasing the true positive rate to 21 %, resulting in a F1.5 score of 28 %. The F1.5 optimized parameter settings were a Kraken 2 confidence of 0.075 and Bracken read threshold of 750 reads which gave an average F1.5 score of 80 %. It was thus clear that the default parameter settings were not optimal for the ds-param-opt dataset. A reason for why false positive observations occur, aside from database misanno- tation, is because a portion of the reads from a species truly in the sample have a higher K-mer match to other closely related species. If a species truly in the sample is in high abundance then the proportion of reads originating from it that are misclassified can appear significant even if the classification error rate is low on read level. Furthermore, the misclassifications have been shown to be systematic where finding a certain species in high abundance is correlated to also falsely finding specific species in low abundance [54]. The systematic error was more prevalent in Kraken 2 compared to marker gene type taxonomic classifiers such as MetaPhlAn [54]. That there are systematic errors are however not surprising since the species have varying levels of phylogenetic relatedness. The problem of systematic misclas- sifications could then worsen if the abundance of the false positives are artificially inflated by Bracken’s redistribution of reads at higher taxonomic levels to lower levels. A threshold for Bracken abundance reestimation in the form of a minimum fraction of reads classified by Kraken 2 as a specific taxon could be more scalable than using an absolute number of reads. If a choice had to be made between ad- justing Kraken 2 or Bracken, arguably the Kraken 2 confidence would be increased form default because its effects are on read level and are independent of the number of species or the number of reads in the sample. 29 4. Results and Discussion (a) Mean PPV (precision) (b) Mean TPR (recall) (c) Mean F1 (d) Mean F1.5 Figure 4.2: Average species level classification metrics for samples in the dataset ds- param-opt re-run for Kraken 2 confidence parameter settings between 0.0-0.2 on the x-axis and a Bracken read threshold between 10 and 1250 on the y-axis. The reference database used was the db-standard. The highest F1.5 score was observed for a Kraken 2 confidence of 0.075 in combination with a Bracken minimum number read threshold of 750 reads. The samples in the larger ds-ln-high, ds-ln-low and ds-un-low datasets were tax- onomically classified with the Kraken 2 and Bracken combination using different databases for the default parameter settings and then re-run for ’optimized’ param- eter settings (confidence = 0.075, read threshold = 750). In Table 4.2 the resulting dataset averages of species F1.5, true positive rates and false discovery rates are listed for five databases while Figure 4.3 graphically shows trends only for the baseline db- 30 4. Results and Discussion standard database.1 For default Kraken 2/Bracken parameters and the db-standard database, the average F1.5 scores were 27.1 %, 10.6 % and 10.6 % for the ds-ln-high, ds-ln-low and ds-un-low datasets respectively. When instead optimized parameters were used for classification the average scores increased to 53.1 %, 77.4 % and 82.2 % for the three datasets. That the relative improvement was larger for the ds-ln-low and ds-un-low datasets, compared to ds-ln-high, was not surprising because they were more similar to the dataset used for F1.5 optimization in terms of species diver- sity, sequencing depth and program used for simulation. Furthermore, the datasets average false discovery rates decreased from 88.0 %, 96.5 % and 96.5 % for the de- fault parameters to 10.3 %, 24.2 % and 18.3 % for the optimized. In contrast, the largest decrease was observed for the ds-ln-high, however, the true positive rate also decreased the most from 76.4 % to 46.0 %. A probable reason for both lower true positive rates and false discovery rates for the ds-ln-high is that the samples have a higher proportion of species with low genome sequence coverage (less than the Bracken 750 reads threshold) compared to the ds-ln-low and ds-un-low. As the exact same sample datasets were used in each experimental runs, the averages of classification performance metrics can be compared between different databases and parameter settings, however it is important to note that the spread across sam- ples within a dataset can be large, as can be observed in Table 4.2 where approxi- mately 10-20 percentage points differ between the 25 % and 75 % percentiles. Also noteworthy is that the average classification performance of samples in a dataset would not be relevant when for example screening a patient and a confidence inter- val on performance for a certain type of sample would be more applicable. For a future unknown metagenomic sample, with possibly a low sequencing depth, using a Kraken 2 confidence 0.075 and Bracken read threshold of 750 reads which was optimal for the ds-param-opt dataset could be too aggressive. As an example take the the sharp decrease in true positive rate for the ds-ln-high. For the ds-param- opt, the mean F1.5 increased steeply for increased Bracken read thresholds between 10 and 250 reads (not shown in the results) and it was seen that a small change in parameter values could make a difference. It was therefore hypothesized that a smaller parameter adjustment could increase the F1.5 score while being ’safe’ for more diverse types of sample compositions. The datasets were re-run again for parameter settings of Kraken 2 confidence 0.05 and Bracken read threshold of 50 reads. The results were that the ds-ln-high performed better while the ds-ln-low and ds-un- low performed worse compared to classification with ’optimal’ parameter settings. Classification with a conservative parameter adjustment however performed better for F1.5 than with default settings for all the datasets. 1Additional F1 scores are listed in Appendix B together with the genus level classification performance for each combination of parameter setting, dataset and database. 31 4. Results and Discussion (a) False discovery rate (b) True positive rate (c) F1.5 score (d) F1 score Figure 4.3: Mean [25 percentile, 75 percentile] classification metrics for Kraken 2 con- fidence and Bracken minimum number of reads threshold parameter settings of default (conf. = 0.00, min reads = 50), optimized (conf. = 0.075, min reads = 750) and light (conf. = 0.05, min reads = 10) across datasets ds-ln-high, ds-ln-low and ds-un-low run with the db-standard reference database. (a) false discovery rates (FDR), (b) true positive rates (TPR), (c) F1.5 scores and (e) F1 scores. The conclusion is that the default Kraken 2 confidence and Bracken read threshold may be far from optimal. What is optimal parameter settings can vary depending on sample type and that for future implementation it could be good to optimize specifically for the range of sequence coverage, species number and diversity that is relevant for the use case. For example if the pipeline would be used for identifying bacteria in skin swabs, samples from a skin microbiome could be simulated and used for a parameter optimization. A conservative threshold increase from default to confidence of 0.05 and read threshold of 50 reads could be a compromise for unknown samples. 32 4. Results and Discussion Dataset Database F1.5 (%) TPR (%) FDR (%) D ef au lt (c on f. 0, m in re ad s 10 ds-ln- high db-standard 27.1 [14.6,40.0] 76.3 [71.5,89.6] 88.0 [81.8,94.8] db-bac-comp 25.3 [13.1,37.3] 75.1 [69.7,89.1] 89.0 [83.6,95.5] db-bac-comp-np 24.2 [12.6,35.6] 75.1 [69.7,89.1] 89.7 [84.6,95.7] db-bac-rep 24.5 [14.2,37.6] 91.5 [88.5,97.8] 90.0 [83.7,95.1] db-bac-rep-np 24.0 [13.9,36.8] 91.5 [88.5,97.9] 90.3 [84.1,95.2] ds-ln- low db-standard 10.6 [8.7,11.4] 98.7 [98.4,100.0] 96.5 [96.2,97.2] db-bac-comp 10.5 [8.6,11.4] 99.5 [100.0,100.0] 96.5 [96.2,97.2] db-bac-comp-np 9.5 [7.8,10.2] 99.5 [100.0,100.0] 96.9 [96.6,97.5] db-bac-rep 5.0 [4.0,5.7] 93.4 [90.0,97.8] 98.4 [98.2,98.7] db-bac-rep-np 4.8 [3.9,5.6] 93.4 [90.0,97.8] 98.4 [98.2,98.8] ds-un- low db-standard 10.6 [8.0,11.8] 99.5 [100.0,100.0] 96.5 [96.1,97.4] db-bac-comp 10.3 [8.2,11.7] 100.0 [100.0,100.0] 96.6 [96.1,97.3] db-bac-comp-np 9.3 [7.1,10.9] 100.0 [100.0,100.0] 96.9 [96.4,97.7] db-bac-rep 5.1 [3.9,5.9] 95.2 [90.0,100.0] 98.4 [98.1,98.8] db-bac-rep-np 4.9 [3.8,5.6] 95.2 [90.0,100.0] 98.4 [98.2,98.8] O pt im al (c on f.0 .0 75 , m in re ad s 75 0) ds-ln- high db-standard 53.1