Efficient distance computations of VLMCs in DNA sequencing

dc.contributor.authorHolm, Sebastian
dc.contributor.authorAtterfors, Johan
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerSchliep, Alexander
dc.contributor.supervisorGustafsson, Joel
dc.date.accessioned2023-06-21T12:07:23Z
dc.date.available2023-06-21T12:07:23Z
dc.date.issued2023
dc.date.submitted2023
dc.description.abstractClassifying species is a crucial part of efforts to stop emerging pathogens e.g. viruses and dangerous diseases. As DNA sequences can be many gigabytes large and the size of current DNA databases is enormous and growing fast, the classification of species is a computationally expensive task. One type of model for DNA sequences is the Variable-Length Markov Chain (VLMC) which has been shown to capture the essential aspects of sequences while reducing memory footprint. A promising efficient method of comparing VLMCs is the d v measure developed by Gustafsson et al. where the main bottleneck is finding the intersection between two VLMCs. This has been shown to currently be cache inefficient. In this thesis, we reduce the execution time and memory access of the d∗ v dissimilarity function, a problem that involves quickly accessing data. We investigated cacheoblivious and other data structures that were adapted to store VLMCs for efficient dissimilarity computation. These structures include the van Emde Boas layout, Eytzinger layout, B-tree layout, Sorted vector, Hash map, and a sorted vector with an index structure we call Sorted Block Search (SBS). We further move from a nested for-loop to utilizing a cache-oblivious matrix transpose algorithm, which we call matrix recursion when constructing a matrix of distances between VLMCs. The data structure, Hash map, attains a speedup of up to 19 times compared to the state-of-the-art when computing all distances between 15 446 VLMCs created from the NCBI Virus Ref Seq database. For larger VLMCs of 10-50 MB, one of the fastest data structures, Sorted vector, attains a speedup of up to 15 times. The new algorithm is benchmarked on a laptop, desktop and one compute cluster. The presented improvements can be utilized in pathogen classification and evolutionary genomics as the new algorithm show good scaling for compute clusters, making these optimizations promising.
dc.identifier.urihttp://hdl.handle.net/20.500.12380/306360
dc.setspec.uppsokTechnology
dc.subjectcomputer science
dc.subjectMarkov Chain
dc.subjectVariable Length Markov Chain
dc.subjectVLMC
dc.subjectK-mer
dc.subjectalgorithm engineering
dc.subjectoblivious
dc.subjectbioinformatics
dc.titleEfficient distance computations of VLMCs in DNA sequencing
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComputer science – algorithms, languages and logic (MPALG), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
Thesis-final.pdf
Storlek:
1.18 MB
Format:
Adobe Portable Document Format
Beskrivning:
License bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: