DNA Sequence Classification Using Variable Length Markov Models

Typ
Examensarbete för masterexamen
Program
Publicerad
2020
Författare
Norlin, Sebastian
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Pathogens such as viruses and bacteria are a major health concern today. To effectively treat these it is important to identify known pathogens and potential new ones from DNA samples. Modern methods are however not good enough at classifying rare, previously undocumented pathogens. This thesis explores nearest neighbor classification using variable length Markov chains (VLMC) as a possible solution. A vantage point tree is used to store the database of VLMC being queried against. This gives promising results when classifying VLMC from complete genomes or chromosomes. Multiple techniques, both greedy approximations and new lower bounds are explored. This results in order of magnitude faster classification than previous research. However the technique ultimately fails at classifying shorter DNA sequences of lengths typically found when sequencing DNA. Multiple reasons for this are given with a possible way forward if further research is deemed relevant.
Beskrivning
Ämne/nyckelord
Computer science , Bioinformatics , Master’s thesis , vantage point tree , metric space , Variable length Markov chains , Markov Models , DNA , Classification
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index