Efficient distance computations of VLMCs in DNA sequencing

Holm, Sebastian; Atterfors, Johan

Efficient distance computations of VLMCs in DNA sequencing

dc.contributor.author	Holm, Sebastian
dc.contributor.author	Atterfors, Johan
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Schliep, Alexander
dc.contributor.supervisor	Gustafsson, Joel
dc.date.accessioned	2023-06-21T12:07:23Z
dc.date.available	2023-06-21T12:07:23Z
dc.date.issued	2023
dc.date.submitted	2023
dc.description.abstract	Classifying species is a crucial part of efforts to stop emerging pathogens e.g. viruses and dangerous diseases. As DNA sequences can be many gigabytes large and the size of current DNA databases is enormous and growing fast, the classification of species is a computationally expensive task. One type of model for DNA sequences is the Variable-Length Markov Chain (VLMC) which has been shown to capture the essential aspects of sequences while reducing memory footprint. A promising efficient method of comparing VLMCs is the d v measure developed by Gustafsson et al. where the main bottleneck is finding the intersection between two VLMCs. This has been shown to currently be cache inefficient. In this thesis, we reduce the execution time and memory access of the d∗ v dissimilarity function, a problem that involves quickly accessing data. We investigated cacheoblivious and other data structures that were adapted to store VLMCs for efficient dissimilarity computation. These structures include the van Emde Boas layout, Eytzinger layout, B-tree layout, Sorted vector, Hash map, and a sorted vector with an index structure we call Sorted Block Search (SBS). We further move from a nested for-loop to utilizing a cache-oblivious matrix transpose algorithm, which we call matrix recursion when constructing a matrix of distances between VLMCs. The data structure, Hash map, attains a speedup of up to 19 times compared to the state-of-the-art when computing all distances between 15 446 VLMCs created from the NCBI Virus Ref Seq database. For larger VLMCs of 10-50 MB, one of the fastest data structures, Sorted vector, attains a speedup of up to 15 times. The new algorithm is benchmarked on a laptop, desktop and one compute cluster. The presented improvements can be utilized in pathogen classification and evolutionary genomics as the new algorithm show good scaling for compute clusters, making these optimizations promising.
dc.identifier.uri	http://hdl.handle.net/20.500.12380/306360
dc.setspec.uppsok	Technology
dc.subject	computer science
dc.subject	Markov Chain
dc.subject	Variable Length Markov Chain
dc.subject	VLMC
dc.subject	K-mer
dc.subject	algorithm engineering
dc.subject	oblivious
dc.subject	bioinformatics
dc.title	Efficient distance computations of VLMCs in DNA sequencing
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Computer science – algorithms, languages and logic (MPALG), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: Thesis-final.pdf
Storlek:: 1.18 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Storlek:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Beskrivning:

Ladda ner

Samlingar

Examensarbeten för masterexamen