Efficient distance computations of VLMCs in DNA sequencing

Holm, Sebastian; Atterfors, Johan

Efficient distance computations of VLMCs in DNA sequencing

Ladda ner

Thesis-final.pdf (1.18 MB)

Publicerad

2023

Författare

Holm, Sebastian

Atterfors, Johan

Typ

Examensarbete för masterexamen
Master's Thesis

Program

Computer science – algorithms, languages and logic (MPALG), MSc

Sammanfattning

Classifying species is a crucial part of efforts to stop emerging pathogens e.g. viruses and dangerous diseases. As DNA sequences can be many gigabytes large and the size of current DNA databases is enormous and growing fast, the classification of species is a computationally expensive task. One type of model for DNA sequences is the Variable-Length Markov Chain (VLMC) which has been shown to capture the essential aspects of sequences while reducing memory footprint. A promising efficient method of comparing VLMCs is the d v measure developed by Gustafsson et al. where the main bottleneck is finding the intersection between two VLMCs. This has been shown to currently be cache inefficient. In this thesis, we reduce the execution time and memory access of the d∗ v dissimilarity function, a problem that involves quickly accessing data. We investigated cacheoblivious and other data structures that were adapted to store VLMCs for efficient dissimilarity computation. These structures include the van Emde Boas layout, Eytzinger layout, B-tree layout, Sorted vector, Hash map, and a sorted vector with an index structure we call Sorted Block Search (SBS). We further move from a nested for-loop to utilizing a cache-oblivious matrix transpose algorithm, which we call matrix recursion when constructing a matrix of distances between VLMCs. The data structure, Hash map, attains a speedup of up to 19 times compared to the state-of-the-art when computing all distances between 15 446 VLMCs created from the NCBI Virus Ref Seq database. For larger VLMCs of 10-50 MB, one of the fastest data structures, Sorted vector, attains a speedup of up to 15 times. The new algorithm is benchmarked on a laptop, desktop and one compute cluster. The presented improvements can be utilized in pathogen classification and evolutionary genomics as the new algorithm show good scaling for compute clusters, making these optimizations promising.

Ämne/nyckelord

computer science, Markov Chain, Variable Length Markov Chain, VLMC, K-mer, algorithm engineering, oblivious, bioinformatics

URI

https://hdl.handle.net/20.500.12380/306360

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Efficient distance computations of VLMCs in DNA sequencing

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

Endorsement

Review

Supplemented By

Referenced By