DNA Sequence Classification Using Variable Length Markov Models
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Pathogens such as viruses and bacteria are a major health concern today. To effectively
treat these it is important to identify known pathogens and potential new ones
from DNA samples. Modern methods are however not good enough at classifying
rare, previously undocumented pathogens.
This thesis explores nearest neighbor classification using variable length Markov
chains (VLMC) as a possible solution. A vantage point tree is used to store the
database of VLMC being queried against. This gives promising results when classifying
VLMC from complete genomes or chromosomes. Multiple techniques, both
greedy approximations and new lower bounds are explored. This results in order of
magnitude faster classification than previous research. However the technique ultimately
fails at classifying shorter DNA sequences of lengths typically found when
sequencing DNA. Multiple reasons for this are given with a possible way forward if
further research is deemed relevant.
Beskrivning
Ämne/nyckelord
Computer science, Bioinformatics, Master’s thesis, vantage point tree, metric space, Variable length Markov chains, Markov Models, DNA, Classification